Databricks Community

kll · ‎04-19-2023

I am curious what is going on under-the-hood when using `multiprocessing` module to parallelize an function call and apply it to a Pandas DataFrame along the row axis.

Specifically, how does it work with DataBricks Architecture / Compute. My cluster configuration is:

2-8 Workers

61-244 GB Memory8-32 Cores1 Driver

30.5 GB Memory, 4 CoresRuntime

12.2.x-scala2.12

For example, here some example code:

import pandas as pd
import requests
from multiprocessing import Pool
 
# Define the API call function
def api_call(row):
    response = requests.get(f'https://api.example.com/?id={row["id"]}')
    return response.json()
 
# Load the data into a Pandas DataFrame
data = pd.read_csv('data.csv')
 
# Define the number of processes to use
num_processes = 4
 
# Create a Pool object to handle the parallel processing
pool = Pool(processes=num_processes)
 
# Apply the API call function to each row of the DataFrame in parallel
results = pool.map(api_call, [row for index, row in data.iterrows()])
 
# Combine the results into a new DataFrame
output = pd.DataFrame(results)
 
# Merge the output DataFrame back into the original DataFrame
data = pd.concat([data, output], axis=1)

I am just trying to understand, what happens under the hood?

https://docs.python.org/3/library/multiprocessing.html

Anonymous · ‎04-20-2023

@Keval Shah :

When using the multiprocessing module in Python to parallelize a function call and apply it to a Pandas DataFrame along the row axis, the following happens under the hood:

The Pool object is created with the specified number of processes.
The input data is split into smaller chunks and distributed across the available processes.
Each process applies the api_call function to its assigned chunk of data in parallel.
The results from each process are collected and combined into a single output.
The output is merged back into the original DataFrame.

In terms of the Databricks architecture, the multiprocessing module works within the context of the Python interpreter running on the driver node. The driver node is responsible for orchestrating the parallel processing of the data across the worker nodes in the cluster. Each worker node runs a separate instance of the Python interpreter and is allocated a portion of the input data to process in parallel. The results are then returned to the driver node, where they are combined and merged back into the original DataFrame.

It's worth noting that Databricks also provides its own parallel processing capabilities through the use of Spark DataFrames and RDDs, which are optimized for distributed computing on large datasets. If you're working with big data, it may be more efficient to use these Spark-based solutions instead of

multiprocessing

.

kll · ‎04-21-2023

@Suteja Kanuri

Thank for the response.

So, given that my DataBricks cluster configuration is 2-8 worker nodes, the driver nodes will allocate the data to be processed in parallel to 4 worker nodes because I have specified `4` in the Pool object? What happens when the `num_processes` I have specified is greater than or less than the available worker nodes.

Would you mind sharing an example of how I'd implement the parallel processing of applying the function along the row axis of a pandas using spark and RDDs?

zzy · ‎08-10-2023

Hi Keval,

Just to confirm that you are saying python multiprocessing will actually use working nodes instead of just the driver node. I thought worker nodes were only used under spark framework so python multiprocessing would only create all the processes in the driver node. Is there any modification to the multiprocessing module in databricks?

Databricks Community

python multiprocessing and the Databricks Architecture - under the hood.

Congratulations Databricks Partners! You're Now Officially Recognized in the Databricks Community

Solution Accelerator Series | Measure Ad Effectiveness With Multi-Touch Attribution

Govern AI Spend at Scale: A Data-Driven Approach to AI Governance | Webinar

Databricks AMER Learning Festival | Virtual Training

Introducing the Genie Hub: Ask Questions, Share Builds, and Master Conversational Analytics