<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Job performance issue : Configurations in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/job-performance-issue-configurations/m-p/120356#M46145</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/101374"&gt;@Vasu_Kumar_T&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This is a classic Spark performance inconsistency issue. The fact that it works fine in your notebook&lt;BR /&gt;but degrades after deployment suggests several potential causes. Here are the most likely culprits and solutions:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Primary Suspects&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;1. Data Skew Variations&lt;/STRONG&gt;&lt;BR /&gt;Your salt-based repartitioning might not be consistently effective&lt;BR /&gt;if the underlying data distribution changes between runs or environments.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;2. Cluster Resource Allocation&lt;/STRONG&gt;&lt;BR /&gt;Fixed 20 nodes doesn't guarantee same resource allocation as auto-scaling.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;3. Memory and Executor Configuration&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Environment-Specific Solutions&lt;BR /&gt;Check these deployment differences:&lt;BR /&gt;- Spark version consistency between notebook and deployment&lt;BR /&gt;- Network bandwidth between nodes in production vs. development&lt;BR /&gt;- Storage type (SSD vs. HDD) and I/O throughput&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 27 May 2025 18:00:49 GMT</pubDate>
    <dc:creator>lingareddy_Alva</dc:creator>
    <dc:date>2025-05-27T18:00:49Z</dc:date>
    <item>
      <title>Job performance issue : Configurations</title>
      <link>https://community.databricks.com/t5/data-engineering/job-performance-issue-configurations/m-p/120322#M46138</link>
      <description>&lt;P&gt;Hello All,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;One job taking more than 7hrs, when we added below configuration its taking &amp;lt;2:30 mins but after deployment with same parameters again its taking 7+hrs.&lt;/P&gt;&lt;P&gt;&amp;nbsp;1) spark.conf.set("spark.sql.shuffle.partitions", 500) --&amp;gt; spark.conf.set("spark.sql.shuffle.partitions", 20000)&amp;nbsp;&lt;BR /&gt;2) spark.catalog.clearCache()&lt;BR /&gt;for (id, rdd) in spark.sparkContext._jsc.getPersistentRDDs().items():&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; rdd.unpersist()&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; print("Unpersisted {} rdd".format(id))&lt;BR /&gt;&lt;BR /&gt;3) DF = DF.withColumn('salt', F.rand())&lt;BR /&gt;DF = DF.repartition(100, 'salt')&lt;/P&gt;&lt;P&gt;&amp;nbsp;Tried with fixed 20 nodes still taking 7+ hrs after deployment(no change in notebook and cluster configuration)&lt;/P&gt;&lt;P&gt;&amp;nbsp;Before deployment 1:20(Auto scaling) also taking &amp;lt;2:30 mins&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any suggesstions are appriciated. Thanks&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Vasu&lt;/P&gt;</description>
      <pubDate>Tue, 27 May 2025 13:21:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/job-performance-issue-configurations/m-p/120322#M46138</guid>
      <dc:creator>Vasu_Kumar_T</dc:creator>
      <dc:date>2025-05-27T13:21:35Z</dc:date>
    </item>
    <item>
      <title>Re: Job performance issue : Configurations</title>
      <link>https://community.databricks.com/t5/data-engineering/job-performance-issue-configurations/m-p/120356#M46145</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/101374"&gt;@Vasu_Kumar_T&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This is a classic Spark performance inconsistency issue. The fact that it works fine in your notebook&lt;BR /&gt;but degrades after deployment suggests several potential causes. Here are the most likely culprits and solutions:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Primary Suspects&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;1. Data Skew Variations&lt;/STRONG&gt;&lt;BR /&gt;Your salt-based repartitioning might not be consistently effective&lt;BR /&gt;if the underlying data distribution changes between runs or environments.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;2. Cluster Resource Allocation&lt;/STRONG&gt;&lt;BR /&gt;Fixed 20 nodes doesn't guarantee same resource allocation as auto-scaling.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;3. Memory and Executor Configuration&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Environment-Specific Solutions&lt;BR /&gt;Check these deployment differences:&lt;BR /&gt;- Spark version consistency between notebook and deployment&lt;BR /&gt;- Network bandwidth between nodes in production vs. development&lt;BR /&gt;- Storage type (SSD vs. HDD) and I/O throughput&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 27 May 2025 18:00:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/job-performance-issue-configurations/m-p/120356#M46145</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-05-27T18:00:49Z</dc:date>
    </item>
  </channel>
</rss>

