- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-10-2025 07:20 AM
Hello Community,
We have been trying to migrate our jobs from Classic Compute to Serverless Compute. As part of this process, we face several challenges, and this is one of them.
When we read CSV or JSON files with multiLine=true, the load becomes single-threaded and tries to process all the data in a single thread, with all kinds of custom transformations we have. Unless I do the repartition by validating the number of partitions available in the dataframe, the process will not be executed in parallel.
In Classic Compute, I read the the number of partitions of a dataframe by using rdd.getNumPartitions() and then I repartition().
When I tried to execute the same code in Serverless, this started failing with "pyspark.errors.exceptions.base.PySparkNotImplementedError: [NOT_IMPLEMENTED] rdd is not implemented" error.
What We’re Looking For:
We’re trying to find an alternative way to determine the number of partitions in a DataFrame within serverless compute. This check is critical for us because:
- If the DataFrame has too few partitions, the job execution time increases significantly.
- We want to avoid blindly repartitioning every DataFrame unless necessary
Questions for the Community:
- Is there any supported method in serverless compute to inspect or infer the current partition count of a DataFrame?
- Are there best practices or heuristics others are using to handle this kind of conditional repartitioning in serverless environments?
Any guidance, workarounds, or insights would be greatly appreciated!
#Serverless
#Compute
#pySPark
#DataEngineering
#Migration
Ramana