Pandas API on Spark creates huge query plans
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-11-2023 06:21 PM
Hello,
I have a piece of code written in Pyspark and Pandas API on Spark. On comparing the query plans, I see Pandas API on Spark creates huge query plans whereas Pyspark plan is a tiny one. Furthermore, with Pandas API on spark, we see a lot of inconsistencies in the generated results. Pyspark code executes in 4 mins and Pandas API on Spark takes 16 mins. Do we have reasons or documented issues for these? Would like to know why this is happening so that I can address the problem
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-12-2025 05:11 PM
Did you find an answer?
I’ve noticed a similar situation where a simple pyspark.pandas query if far more complex and slower than a pyspark sql.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-16-2025 06:32 AM - edited 08-16-2025 06:33 AM
@FRB1984 could you provide some examples? I'm curious. My first thoughts would be around the shuffling. Check this out: https://spark.apache.org/docs/3.5.4/api/python/user_guide/pandas_on_spark/best_practices.html . There's an argument to be made about how the code is being written. That'll play into the execution plan, naturally.
There's some other best practices worth noting on that documentation page:
@FRB1984 have you looked into this:
The index type will likely have a contribution and consequence to the execution plan being larger. You can alter this setting in the options: https://spark.apache.org/docs/latest/api/python/tutorial/pandas_on_spark/options.html
Let me know how you get on.
All the best,
BS