Re: Struggle to parallelize UDF

Dimitry · ‎06-17-2025

Hi all

I have 2 clusters, that look identical but one runs my UDF in parallel another one does not.

The ones that do is personal, the bad one is shared.

import pandas as pd
from datetime import datetime
from time import sleep
import threading

# test function
def func(x: pd.DataFrame):
    sleep(1)
    return pd.DataFrame({'id': x['id'], 'timestamp': str(datetime.now()), 'thread': threading.get_native_id()})

# native
sdf = spark.range(start=0, end=40, step=1, numPartitions=8)

now = datetime.now()
sdf = sdf.groupby('id').applyInPandas(func, schema="id int, timestamp string, thread int")
result = spark.createDataFrame(sdf.toPandas()) # trigger lazy evaluation
print((datetime.now() - now).total_seconds())

display(result.groupBy("thread").count())

Personal cluster splits into 4 threads (as CPUs) but the shared one doesn't

This is personal vs shared clusters configuration, I don't get what is making them to work differently.

Note that in the real code I'm using repartition to achieve the same effect and it also works on the personal cluster but not on the shared.

Please help!!

_sqldf.repartition(max_number_of_threads, "batch_id").groupBy("batch_id").applyInPandas(..)

Dimitry · ‎06-17-2025

I sort of fixed it myself. Screenshot above was incorrect for the shared compute.

and the fix was in changing the access mode

View solution in original post

Dimitry · ‎06-17-2025

As a side note "no isolation shared" cluster has no access to unity catalog, so no table queries.

I resorted to using personal compute assigned to a group.