Wednesday
Hello Community,
We have been trying to migrate our jobs from Classic Compute to Serverless Compute. As part of this process, we face several challenges, and this is one of them.
When we read CSV or JSON files with multiLine=true, the load becomes single-threaded and tries to process all the data in a single thread, with all kinds of custom transformations we have. Unless I do the repartition by validating the number of partitions available in the dataframe, the process will not be executed in parallel.
In Classic Compute, I read the the number of partitions of a dataframe by using rdd.getNumPartitions() and then I repartition().
When I tried to execute the same code in Serverless, this started failing with "pyspark.errors.exceptions.base.PySparkNotImplementedError: [NOT_IMPLEMENTED] rdd is not implemented" error.
What We’re Looking For:
We’re trying to find an alternative way to determine the number of partitions in a DataFrame within serverless compute. This check is critical for us because:
Questions for the Community:
Any guidance, workarounds, or insights would be greatly appreciated!
#Serverless
#Compute
#pySPark
#DataEngineering
#Migration
Wednesday
Hi @Ramana ,
Yep, RDD API is not supported on Serveless
As a workaround you can obtain number of partitions in following way - using spark_partiton_id and then counting distinct occurance of each id
from pyspark.sql.functions import spark_partition_id, countDistinct
df = spark.read.table("workspace.default.product_dimension")
display((
df.withColumn("partitionid",spark_partition_id())
.select("partitionid")
.agg(countDistinct("partitionid"))
))
Thursday
Thank you @szymon_dybczak for your workaround suggestion.
As a workaround, it is okay to do this, but I don't think this is a PROD solution for long-running jobs.
I am looking for a more production-oriented solution, especially for long-running jobs.
Thursday - last edited Thursday
Yep, I do agree with you that it's not production ready workaround. But I don't think you will be able to find any valuable one either.
Serverless doesn't have access to add API and does not support setting most Spark properties for notebooks or jobs, as you can read here:
So your options are really limited here. With serverless the assumption is that the optimization part is done for you by Databricks.
But as we can see based on your case, it doesn't always work as expected.
Maybe for that particular job consider using classic compute?
Thursday
@szymon_dybczak I have a list of issues not to use serverless, and this is one of them. Currently, most of my jobs use Classic compute.
If I find or hear something from the Serverless team, I will let you know.
Thursday
Thanks @Ramana , really appreciate it. This is really important topic, especially now - when they encourage us more and more to migrate our workloads to serverless.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now