Re: Pandas API on Spark creates huge query plans

varshanagarajan · ‎10-11-2023

Hello,

I have a piece of code written in Pyspark and Pandas API on Spark. On comparing the query plans, I see Pandas API on Spark creates huge query plans whereas Pyspark plan is a tiny one. Furthermore, with Pandas API on spark, we see a lot of inconsistencies in the generated results. Pyspark code executes in 4 mins and Pandas API on Spark takes 16 mins. Do we have reasons or documented issues for these? Would like to know why this is happening so that I can address the problem

FRB1984 · ‎08-12-2025

Did you find an answer?

I’ve noticed a similar situation where a simple pyspark.pandas query if far more complex and slower than a pyspark sql.

BS_THE_ANALYST · ‎08-16-2025

@FRB1984 could you provide some examples? I'm curious. My first thoughts would be around the shuffling. Check this out: https://spark.apache.org/docs/3.5.4/api/python/user_guide/pandas_on_spark/best_practices.html . There's an argument to be made about how the code is being written. That'll play into the execution plan, naturally.

There's some other best practices worth noting on that documentation page:

@FRB1984 have you looked into this:

The index type will likely have a contribution and consequence to the execution plan being larger. You can alter this setting in the options: https://spark.apache.org/docs/latest/api/python/tutorial/pandas_on_spark/options.html

Let me know how you get on.

All the best,
BS