topic Re: Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions() in Data Engineering

Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()

Ramana — Wed, 10 Sep 2025 14:20:59 GMT

Hello Community,

We have been trying to migrate our jobs from Classic Compute to Serverless Compute. As part of this process, we face several challenges, and this is one of them.

When we read CSV or JSON files with multiLine=true, the load becomes single-threaded and tries to process all the data in a single thread, with all kinds of custom transformations we have. Unless I do the repartition by validating the number of partitions available in the dataframe, the process will not be executed in parallel.

In Classic Compute, I read the the number of partitions of a dataframe by using rdd.getNumPartitions() and then I repartition().

When I tried to execute the same code in Serverless, this started failing with "pyspark.errors.exceptions.base.PySparkNotImplementedError: [NOT_IMPLEMENTED] rdd is not implemented" error.

What We’re Looking For:

We’re trying to find an alternative way to determine the number of partitions in a DataFrame within serverless compute. This check is critical for us because:

If the DataFrame has too few partitions, the job execution time increases significantly.
We want to avoid blindly repartitioning every DataFrame unless necessary

Questions for the Community:

Is there any supported method in serverless compute to inspect or infer the current partition count of a DataFrame?
Are there best practices or heuristics others are using to handle this kind of conditional repartitioning in serverless environments?

Any guidance, workarounds, or insights would be greatly appreciated!

#Serverless

#Compute

#pySPark

#DataEngineering

#Migration

Re: Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()

szymon_dybczak — Wed, 10 Sep 2025 15:48:56 GMT

Hi @Ramana ,

Yep, RDD API is not supported on Serveless

As a workaround you can obtain number of partitions in following way - using spark_partiton_id and then counting distinct occurance of each id

from pyspark.sql.functions import spark_partition_id, countDistinct df = spark.read.table("workspace.default.product_dimension") display(( df.withColumn("partitionid",spark_partition_id()) .select("partitionid") .agg(countDistinct("partitionid")) ))

Re: Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()

Ramana — Thu, 11 Sep 2025 17:06:49 GMT

Thank you @szymon_dybczak for your workaround suggestion.

As a workaround, it is okay to do this, but I don't think this is a PROD solution for long-running jobs.

I am looking for a more production-oriented solution, especially for long-running jobs.

Re: Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()

szymon_dybczak — Thu, 11 Sep 2025 17:42:39 GMT

Yep, I do agree with you that it's not production ready workaround. But I don't think you will be able to find any valuable one either.

Serverless doesn't have access to add API and does not support setting most Spark properties for notebooks or jobs, as you can read here:

https://docs.databricks.com/aws/en/spark/conf#configure-spark-properties-for-serverless-notebooks-and-jobs

So your options are really limited here. With serverless the assumption is that the optimization part is done for you by Databricks.

But as we can see based on your case, it doesn't always work as expected.

Maybe for that particular job consider using classic compute?

Re: Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()

Ramana — Thu, 11 Sep 2025 18:00:42 GMT

@szymon_dybczak I have a list of issues not to use serverless, and this is one of them. Currently, most of my jobs use Classic compute.

If I find or hear something from the Serverless team, I will let you know.

Re: Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()

szymon_dybczak — Thu, 11 Sep 2025 19:43:56 GMT

Thanks @Ramana , really appreciate it. This is really important topic, especially now - when they encourage us more and more to migrate our workloads to serverless.