cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Struggle to parallelize UDF

Dimitry
New Contributor III

Hi all

 

I have 2 clusters, that look identical but one runs my UDF in parallel another one does not.

The ones that do is personal, the bad one is shared.

import pandas as pd
from datetime import datetime
from time import sleep
import threading

# test function
def func(x: pd.DataFrame):
    sleep(1)
    return pd.DataFrame({'id': x['id'], 'timestamp': str(datetime.now()), 'thread': threading.get_native_id()})

# native
sdf = spark.range(start=0, end=40, step=1, numPartitions=8)

now = datetime.now()
sdf = sdf.groupby('id').applyInPandas(func, schema="id int, timestamp string, thread int")
result = spark.createDataFrame(sdf.toPandas()) # trigger lazy evaluation
print((datetime.now() - now).total_seconds())

display(result.groupBy("thread").count())

 Personal cluster splits into 4 threads (as CPUs) but the shared one doesn't

Dimitry_0-1750216264118.pngDimitry_1-1750216332766.png

 

This is personal vs shared clusters configuration, I don't get what is making them to work differently.

Dimitry_3-1750216642622.png

 

Note that in the real code I'm using repartition to achieve the same effect and it also works on the personal cluster but not on the shared.

 

Please help!!

_sqldf.repartition(max_number_of_threads, "batch_id").groupBy("batch_id").applyInPandas(..)
1 ACCEPTED SOLUTION

Accepted Solutions

Dimitry
New Contributor III

I sort of fixed it myself. Screenshot above was incorrect for the shared compute.

Dimitry_1-1750219005650.png

and the fix was in changing the access mode

Dimitry_2-1750219043277.png

 

Dimitry_4-1750219253241.png

 

 

 

 

View solution in original post

2 REPLIES 2

Dimitry
New Contributor III

I sort of fixed it myself. Screenshot above was incorrect for the shared compute.

Dimitry_1-1750219005650.png

and the fix was in changing the access mode

Dimitry_2-1750219043277.png

 

Dimitry_4-1750219253241.png

 

 

 

 

Dimitry
New Contributor III

As a side note "no isolation shared" cluster has no access to unity catalog, so no table queries.

I resorted to using personal compute assigned to a group.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now