Hi all
I have 2 clusters, that look identical but one runs my UDF in parallel another one does not.
The ones that do is personal, the bad one is shared.
import pandas as pd
from datetime import datetime
from time import sleep
import threading
# test function
def func(x: pd.DataFrame):
sleep(1)
return pd.DataFrame({'id': x['id'], 'timestamp': str(datetime.now()), 'thread': threading.get_native_id()})
# native
sdf = spark.range(start=0, end=40, step=1, numPartitions=8)
now = datetime.now()
sdf = sdf.groupby('id').applyInPandas(func, schema="id int, timestamp string, thread int")
result = spark.createDataFrame(sdf.toPandas()) # trigger lazy evaluation
print((datetime.now() - now).total_seconds())
display(result.groupBy("thread").count())
Personal cluster splits into 4 threads (as CPUs) but the shared one doesn't


This is personal vs shared clusters configuration, I don't get what is making them to work differently.

Note that in the real code I'm using repartition to achieve the same effect and it also works on the personal cluster but not on the shared.
Please help!!
_sqldf.repartition(max_number_of_threads, "batch_id").groupBy("batch_id").applyInPandas(..)