Re: spark.sql with CTEs (10 minutes) VS pyspark co...

NandiniN · ‎04-30-2025

The code is not a apple to apple comparison, and debugging with the help of Spark UI, plan can give a better understanding.

But reviewing the code I can see in the PySpark implementation, you explicitly repartition the DataFrame (repartition("JobApplicationFK")), which helps in optimizing the data distribution for subsequent operations. This reduces the amount of shuffling during LEAD() window function and join operations. But you also mentioned that you tested without repartition, can you please check on the Spark UI. DAG review can give you insights on where more time has been taken.

On the other hand, the SQL implementation uses DISTRIBUTE BY, which might not achieve the same level of optimization depending on the actual execution plan generated.