Serverless - can't parallelize UDF in applyInPandas

Dimitry — Fri, 17 Oct 2025 06:05:40 GMT

HI all

Serverless V3 solved an error of mismatching python versions between driver and worker which I had on V2 (can't remember the exact wording).

So I'd been running this on classic compute so far.

Today I tried on serverless to a partial success - unfortunately the UDF is being executed on a single CPU / thread.

I use repartition to split the workload to worker nodes.

Here is a mock script to demonstrate how I execute UDF. In the mock UDF just sleeps a second:

import pandas as pd from datetime import datetime from time import sleep import threading # test function def func(x: pd.DataFrame): sleep(1) return pd.DataFrame({'id': x['id'], 'timestamp': str(datetime.now()), 'thread': threading.get_native_id()}) # from pandas pdf = pd.DataFrame({'id': range(40)}) sdf = spark.createDataFrame(pdf) now = datetime.now() sdf = sdf.repartition(8, "id").groupby('id').applyInPandas(func, schema="id int, timestamp string, thread int") result = spark.createDataFrame(sdf.toPandas()) # trigger lazy evaluation print((datetime.now() - now).total_seconds()) # expected at least 4 display(result.groupBy("thread").count())

It repartitions an array 40 into 8, but as classic compute has only 4 CPU, it will execute on 4 threads:

Now, when the same runs on serverless, I'm getting a single thread:

QUESTION: how to parallelize job on serverless compute for my data frame, besides repartitioning (which obviously does not work)?

Re: Serverless - can't parallelize UDF in applyInPandas

Dimitry — Fri, 17 Oct 2025 06:11:33 GMT

I was wrong in interpreting the results.

threading.get_native_id() does not work on serverless as on classic, so different threads return the same ID.

The time it takes to execute the test is obviously less than 40 seconds, if it was running on a single thread.

topic Re: Serverless - can't parallelize UDF in applyInPandas in Data Engineering

Serverless - can't parallelize UDF in applyInPandas

Re: Serverless - can't parallelize UDF in applyInPandas