<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Pandas API on Spark creates huge query plans in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/pandas-api-on-spark-creates-huge-query-plans/m-p/48991#M5992</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I have a piece of code written in Pyspark and Pandas API on Spark. On comparing the query plans, I see Pandas API on Spark creates huge query plans whereas Pyspark plan is a tiny one. Furthermore, with Pandas API on spark, we see a lot of inconsistencies in the generated results. Pyspark code executes in 4 mins and Pandas API on Spark takes 16 mins. Do we have reasons or documented issues for these? Would like to know why this is happening so that I can address the problem&lt;/P&gt;</description>
    <pubDate>Thu, 12 Oct 2023 01:21:19 GMT</pubDate>
    <dc:creator>varshanagarajan</dc:creator>
    <dc:date>2023-10-12T01:21:19Z</dc:date>
    <item>
      <title>Pandas API on Spark creates huge query plans</title>
      <link>https://community.databricks.com/t5/get-started-discussions/pandas-api-on-spark-creates-huge-query-plans/m-p/48991#M5992</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I have a piece of code written in Pyspark and Pandas API on Spark. On comparing the query plans, I see Pandas API on Spark creates huge query plans whereas Pyspark plan is a tiny one. Furthermore, with Pandas API on spark, we see a lot of inconsistencies in the generated results. Pyspark code executes in 4 mins and Pandas API on Spark takes 16 mins. Do we have reasons or documented issues for these? Would like to know why this is happening so that I can address the problem&lt;/P&gt;</description>
      <pubDate>Thu, 12 Oct 2023 01:21:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/pandas-api-on-spark-creates-huge-query-plans/m-p/48991#M5992</guid>
      <dc:creator>varshanagarajan</dc:creator>
      <dc:date>2023-10-12T01:21:19Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas API on Spark creates huge query plans</title>
      <link>https://community.databricks.com/t5/get-started-discussions/pandas-api-on-spark-creates-huge-query-plans/m-p/128285#M10524</link>
      <description>&lt;P&gt;Did you find an answer?&amp;nbsp;&lt;/P&gt;&lt;P&gt;I’ve noticed a similar situation where a simple pyspark.pandas query if far more complex and slower than a pyspark sql.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 13 Aug 2025 00:11:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/pandas-api-on-spark-creates-huge-query-plans/m-p/128285#M10524</guid>
      <dc:creator>FRB1984</dc:creator>
      <dc:date>2025-08-13T00:11:37Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas API on Spark creates huge query plans</title>
      <link>https://community.databricks.com/t5/get-started-discussions/pandas-api-on-spark-creates-huge-query-plans/m-p/128629#M10540</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/164852"&gt;@FRB1984&lt;/a&gt;&amp;nbsp;could you provide some examples? I'm curious. My first thoughts would be around the shuffling. Check this out:&amp;nbsp;&lt;A href="https://spark.apache.org/docs/3.5.4/api/python/user_guide/pandas_on_spark/best_practices.html" target="_blank" rel="noopener"&gt;https://spark.apache.org/docs/3.5.4/api/python/user_guide/pandas_on_spark/best_practices.html&lt;/A&gt;&amp;nbsp;. There's an argument to be made about how the code is being written. That'll play into the execution plan, naturally.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="BS_THE_ANALYST_0-1755350591093.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19129i522005CAAC31E475/image-size/medium?v=v2&amp;amp;px=400" role="button" title="BS_THE_ANALYST_0-1755350591093.png" alt="BS_THE_ANALYST_0-1755350591093.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;There's some other best practices worth noting on that documentation page:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="BS_THE_ANALYST_1-1755350686312.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19130i51452AEC50178173/image-size/medium?v=v2&amp;amp;px=400" role="button" title="BS_THE_ANALYST_1-1755350686312.png" alt="BS_THE_ANALYST_1-1755350686312.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/164852"&gt;@FRB1984&lt;/a&gt;&amp;nbsp; have you looked into this:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="BS_THE_ANALYST_2-1755351029710.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19131i2D729B4DB57BB4F7/image-size/medium?v=v2&amp;amp;px=400" role="button" title="BS_THE_ANALYST_2-1755351029710.png" alt="BS_THE_ANALYST_2-1755351029710.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;The index type will likely have a contribution and consequence to the execution plan being larger. You can alter this setting in the options:&amp;nbsp;&lt;A href="https://spark.apache.org/docs/latest/api/python/tutorial/pandas_on_spark/options.html" target="_blank"&gt;https://spark.apache.org/docs/latest/api/python/tutorial/pandas_on_spark/options.html&lt;/A&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Let me know how you get on.&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Sat, 16 Aug 2025 13:33:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/pandas-api-on-spark-creates-huge-query-plans/m-p/128629#M10540</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-08-16T13:33:27Z</dc:date>
    </item>
  </channel>
</rss>

