Re: spark.sql with CTEs (10 minutes) VS pyspark co...

biafch · ‎04-29-2025

Hello,

I have two codes with the exact same outcome, one takes 7-10 minutes to load, and the other takes exactly 3 seconds, and I'm just trying to understand why:

This takes 7-10 minutes:

F_IntakeStepsPerDay = spark.sql("""
WITH BASE AS (
    SELECT
        s.JobApplicationFK,
        s.StepDate,
        ja.CandidateFK
    FROM steps AS s
    INNER JOIN jobapplication AS ja ON ja.RawDataIsCurrent = 1
    WHERE s.RawDataIsCurrent = 1
      AND s.StepDate >= DATE_SUB(CURRENT_DATE(), 35)
),

REPARTITIONED_BASE AS (
    SELECT * FROM BASE DISTRIBUTE BY JobApplicationFK
),

BASE_WITH_NEXT AS (
    SELECT
        b.*,
        LEAD(b.StepDate) OVER (PARTITION BY b.JobApplicationFK ORDER BY b.StepDate) AS NextStepDate
    FROM REPARTITIONED_BASE b
),

JOIN_WITH_DATE AS ( 
    SELECT /*+ RANGE_JOIN(f, 7) */
        f.JobApplicationFK,
        f.CandidateFK,
        f.StepDate,
        f.NextStepDate
    FROM BASE_WITH_NEXT f
    INNER JOIN dim_date d 
      ON d.Date >= f.StepDate 
     AND d.Date < COALESCE(f.NextStepDate, DATE_ADD(f.StepDate, 1))
)

SELECT *
FROM JOIN_WITH_DATE 
ORDER BY JobApplicationFK, StepDate
""")

display(F_IntakeStepsPerDay)

This takes 3 seconds:

from pyspark.sql.functions import col, lead, expr
from pyspark.sql.window import Window

df = (
    spark.table("steps").alias("s")
    .join(spark.table("jobapplication").alias("ja"), on=col("s.JobApplicationFK") == col("ja.JobApplicationBK"))
    .filter(
        (col("s.RawDataIsCurrent") == 1) &
        (col("ja.RawDataIsCurrent") == 1) &
        (col("s.StepDate") >= expr("DATE_SUB(CURRENT_DATE(), 35)"))
    )
    .select("s.JobApplicationFK", "s.StepDate", "ja.CandidateFK")
)

df_repartitioned = df.repartition("JobApplicationFK")

window_spec = Window.partitionBy("JobApplicationFK").orderBy("StepDate")
df_with_next = df_repartitioned.withColumn("NextStepDate", lead("StepDate").over(window_spec))

df_with_next.createOrReplaceTempView("BASE_WITH_NEXT")
spark.catalog.cacheTable("BASE_WITH_NEXT")

F_IntakeStepsPerDay = spark.sql("""
SELECT /*+ RANGE_JOIN(f, 7) */
    d.Date,
    f.JobApplicationFK,
    f.CandidateFK,
    f.StepDate,
    f.NextStepDate
FROM BASE_WITH_NEXT f
INNER JOIN dim_date d ON d.Date >= f.StepDate AND d.Date < COALESCE(f.NextStepDate, DATE_ADD(f.StepDate, 1))
ORDER BY f.JobApplicationFK, f.StepDate
""")

F_IntakeStepsPerDay.display()

I just want to understand why.

Can anyone explain please? just to clarify, even without the repartition in my pyspark code and even without the cache, it still takes 2-3 seconds!

NandiniN · ‎04-30-2025

The code is not a apple to apple comparison, and debugging with the help of Spark UI, plan can give a better understanding.

But reviewing the code I can see in the PySpark implementation, you explicitly repartition the DataFrame (repartition("JobApplicationFK")), which helps in optimizing the data distribution for subsequent operations. This reduces the amount of shuffling during LEAD() window function and join operations. But you also mentioned that you tested without repartition, can you please check on the Spark UI. DAG review can give you insights on where more time has been taken.

On the other hand, the SQL implementation uses DISTRIBUTE BY, which might not achieve the same level of optimization depending on the actual execution plan generated.

spark.sql with CTEs (10 minutes) VS pyspark code + spark.sql (without CTE) (3 seconds), why?