cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()

Ramana
Valued Contributor

Hello Community,

We have been trying to migrate our jobs from Classic Compute to Serverless Compute. As part of this process, we face several challenges, and this is one of them.

When we read CSV or JSON files with multiLine=true, the load becomes single-threaded and tries to process all the data in a single thread, with all kinds of custom transformations we have. Unless I do the repartition by validating the number of partitions available in the dataframe, the process will not be executed in parallel.

In Classic Compute, I read the the number of partitions of a dataframe by using rdd.getNumPartitions() and then I repartition().

When I tried to execute the same code in Serverless, this started failing with "pyspark.errors.exceptions.base.PySparkNotImplementedError: [NOT_IMPLEMENTED] rdd is not implemented" error.

What We’re Looking For:

We’re trying to find an alternative way to determine the number of partitions in a DataFrame within serverless compute. This check is critical for us because:

  • If the DataFrame has too few partitions, the job execution time increases significantly.
  • We want to avoid blindly repartitioning every DataFrame unless necessary

Questions for the Community:

  • Is there any supported method in serverless compute to inspect or infer the current partition count of a DataFrame?
  • Are there best practices or heuristics others are using to handle this kind of conditional repartitioning in serverless environments?

Any guidance, workarounds, or insights would be greatly appreciated!

#Serverless

#Compute

#pySPark

#DataEngineering

#Migration

Thanks
Ramana
5 REPLIES 5

szymon_dybczak
Esteemed Contributor III

Hi @Ramana ,

Yep, RDD API is not supported on Serveless

szymon_dybczak_0-1757519217789.png

As a workaround you can obtain number of partitions in following way - using spark_partiton_id and then counting distinct occurance of each id

from pyspark.sql.functions  import spark_partition_id, countDistinct

df = spark.read.table("workspace.default.product_dimension")

display((
    df.withColumn("partitionid",spark_partition_id())
        .select("partitionid")                
        .agg(countDistinct("partitionid"))
))

 

Thank you @szymon_dybczak for your workaround suggestion.

As a workaround, it is okay to do this, but I don't think this is a PROD solution for long-running jobs.

I am looking for a more production-oriented solution, especially for long-running jobs.

Thanks
Ramana

szymon_dybczak
Esteemed Contributor III

Yep, I do agree with you that it's not production ready workaround. But I don't think you will be able to find any valuable one either. 

Serverless doesn't have access to add API and does not support setting most Spark properties for notebooks or jobs, as you can read here:

https://docs.databricks.com/aws/en/spark/conf#configure-spark-properties-for-serverless-notebooks-an...

So your options are really limited here. With serverless the assumption is that the optimization part is done for you by Databricks.

But as we can see based on your case, it doesn't always work as expected.

Maybe for that particular job consider using classic compute?

@szymon_dybczak I have a list of issues not to use serverless, and this is one of them. Currently, most of my jobs use Classic compute.

If I find or hear something from the Serverless team, I will let you know.

Thanks
Ramana

szymon_dybczak
Esteemed Contributor III

Thanks @Ramana , really appreciate it. This is really important topic, especially now - when they encourage us more and more to migrate our workloads to serverless. 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now