- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-17-2025 08:19 PM
Hi all
I have 2 clusters, that look identical but one runs my UDF in parallel another one does not.
The ones that do is personal, the bad one is shared.
import pandas as pd
from datetime import datetime
from time import sleep
import threading
# test function
def func(x: pd.DataFrame):
sleep(1)
return pd.DataFrame({'id': x['id'], 'timestamp': str(datetime.now()), 'thread': threading.get_native_id()})
# native
sdf = spark.range(start=0, end=40, step=1, numPartitions=8)
now = datetime.now()
sdf = sdf.groupby('id').applyInPandas(func, schema="id int, timestamp string, thread int")
result = spark.createDataFrame(sdf.toPandas()) # trigger lazy evaluation
print((datetime.now() - now).total_seconds())
display(result.groupBy("thread").count())Personal cluster splits into 4 threads (as CPUs) but the shared one doesn't
This is personal vs shared clusters configuration, I don't get what is making them to work differently.
Note that in the real code I'm using repartition to achieve the same effect and it also works on the personal cluster but not on the shared.
Please help!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-17-2025 09:01 PM
I sort of fixed it myself. Screenshot above was incorrect for the shared compute.
and the fix was in changing the access mode
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-17-2025 09:42 PM
As a side note "no isolation shared" cluster has no access to unity catalog, so no table queries.
I resorted to using personal compute assigned to a group.