Pandas API on Spark creates huge query plans

varshanagarajan
New Contributor

Hello,

I have a piece of code written in Pyspark and Pandas API on Spark. On comparing the query plans, I see Pandas API on Spark creates huge query plans whereas Pyspark plan is a tiny one. Furthermore, with Pandas API on spark, we see a lot of inconsistencies in the generated results. Pyspark code executes in 4 mins and Pandas API on Spark takes 16 mins. Do we have reasons or documented issues for these? Would like to know why this is happening so that I can address the problem

FRB1984
New Contributor II

Did you find an answer? 

I’ve noticed a similar situation where a simple pyspark.pandas query if far more complex and slower than a pyspark sql. 

BS_THE_ANALYST
Databricks Partner

@FRB1984 could you provide some examples? I'm curious. My first thoughts would be around the shuffling. Check this out: https://spark.apache.org/docs/3.5.4/api/python/user_guide/pandas_on_spark/best_practices.html . There's an argument to be made about how the code is being written. That'll play into the execution plan, naturally.

BS_THE_ANALYST_0-1755350591093.png

There's some other best practices worth noting on that documentation page:

BS_THE_ANALYST_1-1755350686312.png

@FRB1984  have you looked into this: 

BS_THE_ANALYST_2-1755351029710.png

The index type will likely have a contribution and consequence to the execution plan being larger. You can alter this setting in the options: https://spark.apache.org/docs/latest/api/python/tutorial/pandas_on_spark/options.html 

Let me know how you get on.

All the best,
BS