<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions() in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/serverless-compute-pyspark-any-alternative-for-rdd/m-p/131695#M49195</link>
    <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;I have a list of issues not to use serverless, and this is one of them. Currently, most of my jobs use Classic compute.&lt;/P&gt;&lt;P&gt;If I find or hear something from the Serverless team, I will let you know.&lt;/P&gt;</description>
    <pubDate>Thu, 11 Sep 2025 18:00:42 GMT</pubDate>
    <dc:creator>Ramana</dc:creator>
    <dc:date>2025-09-11T18:00:42Z</dc:date>
    <item>
      <title>Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()</title>
      <link>https://community.databricks.com/t5/data-engineering/serverless-compute-pyspark-any-alternative-for-rdd/m-p/131542#M49128</link>
      <description>&lt;P&gt;&lt;FONT face="courier new,courier"&gt;Hello Community,&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;We have been trying to migrate our jobs from Classic Compute to Serverless Compute. As part of this process, we face several challenges, and this is one of them.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;When we read CSV or JSON files with multiLine=true, the load becomes single-threaded and tries to process all the data in a single thread, with all kinds of custom transformations we have. Unless I do the repartition by validating the number of partitions available in the dataframe, the process will not be executed in parallel.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;In Classic Compute, I read the the number of partitions of a dataframe by using &lt;STRONG&gt;rdd.getNumPartitions()&amp;nbsp;&lt;/STRONG&gt;and then I &lt;STRONG&gt;repartition()&lt;/STRONG&gt;.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;When I tried to execute the same code in Serverless, this started failing with "&lt;FONT color="#FF0000"&gt;pyspark.errors.exceptions.base.PySparkNotImplementedError: [NOT_IMPLEMENTED] rdd is not implemented&lt;/FONT&gt;" error.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;&lt;STRONG&gt;What We’re Looking For:&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;We’re trying to find an &lt;STRONG&gt;alternative way to determine the number of partitions&lt;/STRONG&gt; in a DataFrame within serverless compute. This check is critical for us because:&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT face="courier new,courier"&gt;If the DataFrame has too few partitions, the job execution time increases significantly.&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="courier new,courier"&gt;We want to avoid blindly repartitioning every DataFrame unless necessary&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;&lt;STRONG&gt;Questions for the Community:&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT face="courier new,courier"&gt;Is there any supported method in serverless compute to inspect or infer the current partition count of a DataFrame?&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="courier new,courier"&gt;Are there best practices or heuristics others are using to handle this kind of conditional repartitioning in serverless environments?&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;Any guidance, workarounds, or insights would be greatly appreciated!&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#Serverless&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#Compute&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#pySPark&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#DataEngineering&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT face="courier new,courier"&gt;#Migration&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 10 Sep 2025 14:20:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/serverless-compute-pyspark-any-alternative-for-rdd/m-p/131542#M49128</guid>
      <dc:creator>Ramana</dc:creator>
      <dc:date>2025-09-10T14:20:59Z</dc:date>
    </item>
    <item>
      <title>Re: Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()</title>
      <link>https://community.databricks.com/t5/data-engineering/serverless-compute-pyspark-any-alternative-for-rdd/m-p/131558#M49134</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/40873"&gt;@Ramana&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Yep, RDD API is not supported on Serveless&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1757519217789.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19896i0F83DC3A5AF9CA94/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1757519217789.png" alt="szymon_dybczak_0-1757519217789.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;As a workaround you can obtain number of partitions in following way - using spark_partiton_id and then counting distinct occurance of each id&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql.functions  import spark_partition_id, countDistinct

df = spark.read.table("workspace.default.product_dimension")

display((
    df.withColumn("partitionid",spark_partition_id())
        .select("partitionid")                
        .agg(countDistinct("partitionid"))
))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 10 Sep 2025 15:48:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/serverless-compute-pyspark-any-alternative-for-rdd/m-p/131558#M49134</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-09-10T15:48:56Z</dc:date>
    </item>
    <item>
      <title>Re: Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()</title>
      <link>https://community.databricks.com/t5/data-engineering/serverless-compute-pyspark-any-alternative-for-rdd/m-p/131680#M49188</link>
      <description>&lt;P&gt;Thank you&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;for your workaround suggestion.&lt;/P&gt;&lt;P&gt;As a workaround, it is okay to do this, but I don't think this is a PROD solution for long-running jobs.&lt;/P&gt;&lt;P&gt;I am looking for a more production-oriented solution, especially for long-running jobs.&lt;/P&gt;</description>
      <pubDate>Thu, 11 Sep 2025 17:06:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/serverless-compute-pyspark-any-alternative-for-rdd/m-p/131680#M49188</guid>
      <dc:creator>Ramana</dc:creator>
      <dc:date>2025-09-11T17:06:49Z</dc:date>
    </item>
    <item>
      <title>Re: Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()</title>
      <link>https://community.databricks.com/t5/data-engineering/serverless-compute-pyspark-any-alternative-for-rdd/m-p/131686#M49193</link>
      <description>&lt;P&gt;Yep, I do agree with you that it's not production ready workaround. But I don't think you will be able to find any valuable one either.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Serverless doesn't have access to add API and does not support setting most Spark properties for notebooks or jobs, as you can read here:&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/aws/en/spark/conf#configure-spark-properties-for-serverless-notebooks-and-jobs" target="_blank" rel="noopener"&gt;https://docs.databricks.com/aws/en/spark/conf#configure-spark-properties-for-serverless-notebooks-and-jobs&lt;/A&gt;&lt;/P&gt;&lt;P&gt;So your options are really limited here. With serverless the&amp;nbsp;assumption is that the optimization part is done for you by Databricks.&lt;/P&gt;&lt;P&gt;But as we can see based on your case, it doesn't always work as expected.&lt;/P&gt;&lt;P&gt;Maybe for that particular job consider using classic compute?&lt;/P&gt;</description>
      <pubDate>Thu, 11 Sep 2025 17:42:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/serverless-compute-pyspark-any-alternative-for-rdd/m-p/131686#M49193</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-09-11T17:42:39Z</dc:date>
    </item>
    <item>
      <title>Re: Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()</title>
      <link>https://community.databricks.com/t5/data-engineering/serverless-compute-pyspark-any-alternative-for-rdd/m-p/131695#M49195</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;I have a list of issues not to use serverless, and this is one of them. Currently, most of my jobs use Classic compute.&lt;/P&gt;&lt;P&gt;If I find or hear something from the Serverless team, I will let you know.&lt;/P&gt;</description>
      <pubDate>Thu, 11 Sep 2025 18:00:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/serverless-compute-pyspark-any-alternative-for-rdd/m-p/131695#M49195</guid>
      <dc:creator>Ramana</dc:creator>
      <dc:date>2025-09-11T18:00:42Z</dc:date>
    </item>
    <item>
      <title>Re: Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()</title>
      <link>https://community.databricks.com/t5/data-engineering/serverless-compute-pyspark-any-alternative-for-rdd/m-p/131703#M49200</link>
      <description>&lt;P&gt;Thanks&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/40873"&gt;@Ramana&lt;/a&gt;&amp;nbsp;, really appreciate it. This is really important topic, especially now - when they encourage us more and more to migrate our workloads to serverless.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 11 Sep 2025 19:43:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/serverless-compute-pyspark-any-alternative-for-rdd/m-p/131703#M49200</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-09-11T19:43:56Z</dc:date>
    </item>
  </channel>
</rss>

