I am curious what is going on under-the-hood when using `multiprocessing` module to parallelize an function call and apply it to a Pandas DataFrame along the row axis.
Specifically, how does it work with DataBricks Architecture / Compute. My cluster configuration is:
2-8 Workers
61-244 GB Memory8-32 Cores1 Driver
30.5 GB Memory, 4 CoresRuntime
12.2.x-scala2.12
For example, here some example code:
import pandas as pd
import requests
from multiprocessing import Pool
# Define the API call function
def api_call(row):
response = requests.get(f'https://api.example.com/?id={row["id"]}')
return response.json()
# Load the data into a Pandas DataFrame
data = pd.read_csv('data.csv')
# Define the number of processes to use
num_processes = 4
# Create a Pool object to handle the parallel processing
pool = Pool(processes=num_processes)
# Apply the API call function to each row of the DataFrame in parallel
results = pool.map(api_call, [row for index, row in data.iterrows()])
# Combine the results into a new DataFrame
output = pd.DataFrame(results)
# Merge the output DataFrame back into the original DataFrame
data = pd.concat([data, output], axis=1)
I am just trying to understand, what happens under the hood?
https://docs.python.org/3/library/multiprocessing.html