spark.sql with CTEs (10 minutes) VS pyspark code + spark.sql (without CTE) (3 seconds), why?

biafch — Tue, 29 Apr 2025 08:34:36 GMT

Hello,

I have two codes with the exact same outcome, one takes 7-10 minutes to load, and the other takes exactly 3 seconds, and I'm just trying to understand why:

This takes 7-10 minutes:

F_IntakeStepsPerDay = spark.sql(""" WITH BASE AS ( SELECT s.JobApplicationFK, s.StepDate, ja.CandidateFK FROM steps AS s INNER JOIN jobapplication AS ja ON ja.RawDataIsCurrent = 1 WHERE s.RawDataIsCurrent = 1 AND s.StepDate >= DATE_SUB(CURRENT_DATE(), 35) ), REPARTITIONED_BASE AS ( SELECT * FROM BASE DISTRIBUTE BY JobApplicationFK ), BASE_WITH_NEXT AS ( SELECT b.*, LEAD(b.StepDate) OVER (PARTITION BY b.JobApplicationFK ORDER BY b.StepDate) AS NextStepDate FROM REPARTITIONED_BASE b ), JOIN_WITH_DATE AS ( SELECT /*+ RANGE_JOIN(f, 7) */ f.JobApplicationFK, f.CandidateFK, f.StepDate, f.NextStepDate FROM BASE_WITH_NEXT f INNER JOIN dim_date d ON d.Date >= f.StepDate AND d.Date < COALESCE(f.NextStepDate, DATE_ADD(f.StepDate, 1)) ) SELECT * FROM JOIN_WITH_DATE ORDER BY JobApplicationFK, StepDate """) display(F_IntakeStepsPerDay)

This takes 3 seconds:

from pyspark.sql.functions import col, lead, expr from pyspark.sql.window import Window df = ( spark.table("steps").alias("s") .join(spark.table("jobapplication").alias("ja"), on=col("s.JobApplicationFK") == col("ja.JobApplicationBK")) .filter( (col("s.RawDataIsCurrent") == 1) & (col("ja.RawDataIsCurrent") == 1) & (col("s.StepDate") >= expr("DATE_SUB(CURRENT_DATE(), 35)")) ) .select("s.JobApplicationFK", "s.StepDate", "ja.CandidateFK") ) df_repartitioned = df.repartition("JobApplicationFK") window_spec = Window.partitionBy("JobApplicationFK").orderBy("StepDate") df_with_next = df_repartitioned.withColumn("NextStepDate", lead("StepDate").over(window_spec)) df_with_next.createOrReplaceTempView("BASE_WITH_NEXT") spark.catalog.cacheTable("BASE_WITH_NEXT") F_IntakeStepsPerDay = spark.sql(""" SELECT /*+ RANGE_JOIN(f, 7) */ d.Date, f.JobApplicationFK, f.CandidateFK, f.StepDate, f.NextStepDate FROM BASE_WITH_NEXT f INNER JOIN dim_date d ON d.Date >= f.StepDate AND d.Date < COALESCE(f.NextStepDate, DATE_ADD(f.StepDate, 1)) ORDER BY f.JobApplicationFK, f.StepDate """) F_IntakeStepsPerDay.display()

I just want to understand why.

Can anyone explain please? just to clarify, even without the repartition in my pyspark code and even without the cache, it still takes 2-3 seconds!

Re: spark.sql with CTEs (10 minutes) VS pyspark code + spark.sql (without CTE) (3 seconds), why?

NandiniN — Thu, 01 May 2025 04:12:37 GMT

The code is not a apple to apple comparison, and debugging with the help of Spark UI, plan can give a better understanding.

But reviewing the code I can see in the PySpark implementation, you explicitly repartition the DataFrame (repartition("JobApplicationFK")), which helps in optimizing the data distribution for subsequent operations. This reduces the amount of shuffling during LEAD() window function and join operations. But you also mentioned that you tested without repartition, can you please check on the Spark UI. DAG review can give you insights on where more time has been taken.

On the other hand, the SQL implementation uses DISTRIBUTE BY, which might not achieve the same level of optimization depending on the actual execution plan generated.

topic Re: spark.sql with CTEs (10 minutes) VS pyspark code + spark.sql (without CTE) (3 seconds), why? in Data Engineering

spark.sql with CTEs (10 minutes) VS pyspark code + spark.sql (without CTE) (3 seconds), why?

Re: spark.sql with CTEs (10 minutes) VS pyspark code + spark.sql (without CTE) (3 seconds), why?