HI all
Serverless V3 solved an error of mismatching python versions between driver and worker which I had on V2 (can't remember the exact wording).
So I'd been running this on classic compute so far.
Today I tried on serverless to a partial success - unfortunately the UDF is being executed on a single CPU / thread.
I use repartition to split the workload to worker nodes.
Here is a mock script to demonstrate how I execute UDF. In the mock UDF just sleeps a second:
import pandas as pd
from datetime import datetime
from time import sleep
import threading
# test function
def func(x: pd.DataFrame):
sleep(1)
return pd.DataFrame({'id': x['id'], 'timestamp': str(datetime.now()), 'thread': threading.get_native_id()})
# from pandas
pdf = pd.DataFrame({'id': range(40)})
sdf = spark.createDataFrame(pdf)
now = datetime.now()
sdf = sdf.repartition(8, "id").groupby('id').applyInPandas(func, schema="id int, timestamp string, thread int")
result = spark.createDataFrame(sdf.toPandas()) # trigger lazy evaluation
print((datetime.now() - now).total_seconds())
# expected at least 4
display(result.groupBy("thread").count())
It repartitions an array 40 into 8, but as classic compute has only 4 CPU, it will execute on 4 threads:

Now, when the same runs on serverless, I'm getting a single thread:

QUESTION: how to parallelize job on serverless compute for my data frame, besides repartitioning (which obviously does not work)?