Hello,
I have a piece of code written in Pyspark and Pandas API on Spark. On comparing the query plans, I see Pandas API on Spark creates huge query plans whereas Pyspark plan is a tiny one. Furthermore, with Pandas API on spark, we see a lot of inconsistencies in the generated results. Pyspark code executes in 4 mins and Pandas API on Spark takes 16 mins. Do we have reasons or documented issues for these? Would like to know why this is happening so that I can address the problem