<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Several unavoidable for loops are slowing this PySpark code. Is it possible to improve it? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/several-unavoidable-for-loops-are-slowing-this-pyspark-code-is/m-p/116780#M45373</link>
    <description>&lt;P&gt;Hi. I have a PySpark notebook that takes 25 minutes to run as opposed to one minute in on-prem Linux + Pandas. How can I speed it up?&lt;/P&gt;&lt;P&gt;It's not a volume issue. The input is around 30k rows. Output is the same because there's no filtering or aggregation; just creating new fields. No collect, count, or display statements (which would slow it down).&amp;nbsp;&lt;/P&gt;&lt;P&gt;The main thing is a bunch of mappings I need to apply, but it depends on existing fields and there are various models I need to run. So the mappings are different depending on variable and model. That's where the for loops come in.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now I'm &lt;EM&gt;&lt;STRONG&gt;not&lt;/STRONG&gt;&lt;/EM&gt; iterating over the dataframe itself; just over 15 fields (different variables) and 4 different mappings. Then do that 10 times (once per model).&lt;/P&gt;&lt;P&gt;The worker is m5d 2x large and drivers are r4 2x large, min/max workers are 4/20. This should be fine.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I attached a pic to illustrate the code flow. Does anything stand out that you think I could change or that you think Spark is slow at, such as json.load or create_map?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 28 Apr 2025 13:57:12 GMT</pubDate>
    <dc:creator>397973</dc:creator>
    <dc:date>2025-04-28T13:57:12Z</dc:date>
    <item>
      <title>Several unavoidable for loops are slowing this PySpark code. Is it possible to improve it?</title>
      <link>https://community.databricks.com/t5/data-engineering/several-unavoidable-for-loops-are-slowing-this-pyspark-code-is/m-p/116780#M45373</link>
      <description>&lt;P&gt;Hi. I have a PySpark notebook that takes 25 minutes to run as opposed to one minute in on-prem Linux + Pandas. How can I speed it up?&lt;/P&gt;&lt;P&gt;It's not a volume issue. The input is around 30k rows. Output is the same because there's no filtering or aggregation; just creating new fields. No collect, count, or display statements (which would slow it down).&amp;nbsp;&lt;/P&gt;&lt;P&gt;The main thing is a bunch of mappings I need to apply, but it depends on existing fields and there are various models I need to run. So the mappings are different depending on variable and model. That's where the for loops come in.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now I'm &lt;EM&gt;&lt;STRONG&gt;not&lt;/STRONG&gt;&lt;/EM&gt; iterating over the dataframe itself; just over 15 fields (different variables) and 4 different mappings. Then do that 10 times (once per model).&lt;/P&gt;&lt;P&gt;The worker is m5d 2x large and drivers are r4 2x large, min/max workers are 4/20. This should be fine.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I attached a pic to illustrate the code flow. Does anything stand out that you think I could change or that you think Spark is slow at, such as json.load or create_map?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 28 Apr 2025 13:57:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/several-unavoidable-for-loops-are-slowing-this-pyspark-code-is/m-p/116780#M45373</guid>
      <dc:creator>397973</dc:creator>
      <dc:date>2025-04-28T13:57:12Z</dc:date>
    </item>
    <item>
      <title>Re: Several unavoidable for loops are slowing this PySpark code. Is it possible to improve it?</title>
      <link>https://community.databricks.com/t5/data-engineering/several-unavoidable-for-loops-are-slowing-this-pyspark-code-is/m-p/116814#M45374</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/46589"&gt;@397973&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Spark is optimized for 100s of GB or millions of rows, NOT small in-memory lookups with heavy control flow (unless engineered carefully).&lt;BR /&gt;That's why Pandas is much faster for your specific case now.&lt;/P&gt;&lt;P&gt;Pre-load and Broadcast All Mappings&lt;BR /&gt;Instead of loading json.loads inside loop every time&lt;/P&gt;&lt;P&gt;Use Single Bulk Transformation Instead of Nested withColumn&lt;BR /&gt;Instead of .withColumn inside two nested loops — build all new columns in one transformation.&lt;BR /&gt;Build a list of new columns first, then apply .selectExpr or .select(*cols) once.&lt;/P&gt;&lt;P&gt;Map via UDF or SQL CASE&lt;BR /&gt;If mappings are small and fixed, UDF can be very fast:&lt;BR /&gt;Or, generate a CASE statement dynamically if mappings are simple.&lt;/P&gt;&lt;P&gt;Consider pandas_on_spark (koalas)&lt;BR /&gt;Since your data is tiny (30k rows), maybe don't even use PySpark classic&lt;BR /&gt;Way faster because it bypasses Spark DAG overhead for small data.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 28 Apr 2025 16:10:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/several-unavoidable-for-loops-are-slowing-this-pyspark-code-is/m-p/116814#M45374</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-04-28T16:10:55Z</dc:date>
    </item>
  </channel>
</rss>

