cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

python multiprocessing and the Databricks Architecture - under the hood.

kll
New Contributor III

I am curious what is going on under-the-hood when using `multiprocessing` module to parallelize an function call and apply it to a Pandas DataFrame along the row axis.

Specifically, how does it work with DataBricks Architecture / Compute. My cluster configuration is:

2-8 Workers

61-244 GB Memory8-32 Cores1 Driver

30.5 GB Memory, 4 CoresRuntime

12.2.x-scala2.12

For example, here some example code:

import pandas as pd
import requests
from multiprocessing import Pool
 
# Define the API call function
def api_call(row):
    response = requests.get(f'https://api.example.com/?id={row["id"]}')
    return response.json()
 
# Load the data into a Pandas DataFrame
data = pd.read_csv('data.csv')
 
# Define the number of processes to use
num_processes = 4
 
# Create a Pool object to handle the parallel processing
pool = Pool(processes=num_processes)
 
# Apply the API call function to each row of the DataFrame in parallel
results = pool.map(api_call, [row for index, row in data.iterrows()])
 
# Combine the results into a new DataFrame
output = pd.DataFrame(results)
 
# Merge the output DataFrame back into the original DataFrame
data = pd.concat([data, output], axis=1)

I am just trying to understand, what happens under the hood?

https://docs.python.org/3/library/multiprocessing.html

3 REPLIES 3

Anonymous
Not applicable

@Keval Shah​ :

When using the multiprocessing module in Python to parallelize a function call and apply it to a Pandas DataFrame along the row axis, the following happens under the hood:

  1. The Pool object is created with the specified number of processes.
  2. The input data is split into smaller chunks and distributed across the available processes.
  3. Each process applies the api_call function to its assigned chunk of data in parallel.
  4. The results from each process are collected and combined into a single output.
  5. The output is merged back into the original DataFrame.

In terms of the Databricks architecture, the multiprocessing module works within the context of the Python interpreter running on the driver node. The driver node is responsible for orchestrating the parallel processing of the data across the worker nodes in the cluster. Each worker node runs a separate instance of the Python interpreter and is allocated a portion of the input data to process in parallel. The results are then returned to the driver node, where they are combined and merged back into the original DataFrame.

It's worth noting that Databricks also provides its own parallel processing capabilities through the use of Spark DataFrames and RDDs, which are optimized for distributed computing on large datasets. If you're working with big data, it may be more efficient to use these Spark-based solutions instead of

multiprocessing

.

kll
New Contributor III

@Suteja Kanuri​ 

Thank for the response.

So, given that my DataBricks cluster configuration is 2-8 worker nodes, the driver nodes will allocate the data to be processed in parallel to 4 worker nodes because I have specified `4` in the Pool object? What happens when the `num_processes` I have specified is greater than or less than the available worker nodes.

Would you mind sharing an example of how I'd implement the parallel processing of applying the function along the row axis of a pandas using spark and RDDs?

zzy
New Contributor III

Hi Keval,

Just to confirm that you are saying python multiprocessing will actually use working nodes instead of just the driver node. I thought worker nodes were only used under spark framework so python multiprocessing would only create all the processes in the driver node. Is there any modification to the multiprocessing module in databricks?

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group