Databricks Community

Ramana · ‎09-10-2025

Hello Community,

We have been trying to migrate our jobs from Classic Compute to Serverless Compute. As part of this process, we face several challenges, and this is one of them.

When we read CSV or JSON files with multiLine=true, the load becomes single-threaded and tries to process all the data in a single thread, with all kinds of custom transformations we have. Unless I do the repartition by validating the number of partitions available in the dataframe, the process will not be executed in parallel.

In Classic Compute, I read the the number of partitions of a dataframe by using rdd.getNumPartitions() and then I repartition().

When I tried to execute the same code in Serverless, this started failing with "pyspark.errors.exceptions.base.PySparkNotImplementedError: [NOT_IMPLEMENTED] rdd is not implemented" error.

What We’re Looking For:

We’re trying to find an alternative way to determine the number of partitions in a DataFrame within serverless compute. This check is critical for us because:

If the DataFrame has too few partitions, the job execution time increases significantly.
We want to avoid blindly repartitioning every DataFrame unless necessary

Questions for the Community:

Is there any supported method in serverless compute to inspect or infer the current partition count of a DataFrame?
Are there best practices or heuristics others are using to handle this kind of conditional repartitioning in serverless environments?

Any guidance, workarounds, or insights would be greatly appreciated!

#Serverless

#Compute

#pySPark

#DataEngineering

#Migration

Thanks
Ramana

szymon_dybczak · ‎09-10-2025

Hi @Ramana ,

Yep, RDD API is not supported on Serveless

As a workaround you can obtain number of partitions in following way - using spark_partiton_id and then counting distinct occurance of each id

from pyspark.sql.functions  import spark_partition_id, countDistinct

df = spark.read.table("workspace.default.product_dimension")

display((
    df.withColumn("partitionid",spark_partition_id())
        .select("partitionid")                
        .agg(countDistinct("partitionid"))
))

Ramana · ‎09-11-2025

Thank you @szymon_dybczak for your workaround suggestion.

As a workaround, it is okay to do this, but I don't think this is a PROD solution for long-running jobs.

I am looking for a more production-oriented solution, especially for long-running jobs.

Thanks
Ramana

szymon_dybczak · ‎09-11-2025

Yep, I do agree with you that it's not production ready workaround. But I don't think you will be able to find any valuable one either.

Serverless doesn't have access to add API and does not support setting most Spark properties for notebooks or jobs, as you can read here:

https://docs.databricks.com/aws/en/spark/conf#configure-spark-properties-for-serverless-notebooks-an...

So your options are really limited here. With serverless the assumption is that the optimization part is done for you by Databricks.

But as we can see based on your case, it doesn't always work as expected.

Maybe for that particular job consider using classic compute?

Ramana · ‎09-11-2025

@szymon_dybczak I have a list of issues not to use serverless, and this is one of them. Currently, most of my jobs use Classic compute.

If I find or hear something from the Serverless team, I will let you know.

Thanks
Ramana

szymon_dybczak · ‎09-11-2025

Thanks @Ramana , really appreciate it. This is really important topic, especially now - when they encourage us more and more to migrate our workloads to serverless.

Databricks Community

Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()

Join Us as a Local Community Builder!

Free Edition Hackathon

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Level Up with Databricks Specialist Sessions

🌟 Community Pulse: Your Weekly Roundup! November 07 – 13, 2025

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐