โ04-19-2023 04:48 PM
I am curious what is going on under-the-hood when using `multiprocessing` module to parallelize an function call and apply it to a Pandas DataFrame along the row axis.
Specifically, how does it work with DataBricks Architecture / Compute. My cluster configuration is:
2-8 Workers
61-244 GB Memory8-32 Cores1 Driver
30.5 GB Memory, 4 CoresRuntime
12.2.x-scala2.12
For example, here some example code:
import pandas as pd
import requests
from multiprocessing import Pool
# Define the API call function
def api_call(row):
response = requests.get(f'https://api.example.com/?id={row["id"]}')
return response.json()
# Load the data into a Pandas DataFrame
data = pd.read_csv('data.csv')
# Define the number of processes to use
num_processes = 4
# Create a Pool object to handle the parallel processing
pool = Pool(processes=num_processes)
# Apply the API call function to each row of the DataFrame in parallel
results = pool.map(api_call, [row for index, row in data.iterrows()])
# Combine the results into a new DataFrame
output = pd.DataFrame(results)
# Merge the output DataFrame back into the original DataFrame
data = pd.concat([data, output], axis=1)
I am just trying to understand, what happens under the hood?
โ04-20-2023 07:23 PM
@Keval Shahโ :
When using the multiprocessing module in Python to parallelize a function call and apply it to a Pandas DataFrame along the row axis, the following happens under the hood:
In terms of the Databricks architecture, the multiprocessing module works within the context of the Python interpreter running on the driver node. The driver node is responsible for orchestrating the parallel processing of the data across the worker nodes in the cluster. Each worker node runs a separate instance of the Python interpreter and is allocated a portion of the input data to process in parallel. The results are then returned to the driver node, where they are combined and merged back into the original DataFrame.
It's worth noting that Databricks also provides its own parallel processing capabilities through the use of Spark DataFrames and RDDs, which are optimized for distributed computing on large datasets. If you're working with big data, it may be more efficient to use these Spark-based solutions instead of
multiprocessing
.
โ04-21-2023 11:40 AM
@Suteja Kanuriโ
Thank for the response.
So, given that my DataBricks cluster configuration is 2-8 worker nodes, the driver nodes will allocate the data to be processed in parallel to 4 worker nodes because I have specified `4` in the Pool object? What happens when the `num_processes` I have specified is greater than or less than the available worker nodes.
Would you mind sharing an example of how I'd implement the parallel processing of applying the function along the row axis of a pandas using spark and RDDs?
โ08-10-2023 01:33 PM
Hi Keval,
Just to confirm that you are saying python multiprocessing will actually use working nodes instead of just the driver node. I thought worker nodes were only used under spark framework so python multiprocessing would only create all the processes in the driver node. Is there any modification to the multiprocessing module in databricks?
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.