spark.sql with CTEs (10 minutes) VS pyspark code + spark.sql (without CTE) (3 seconds), why?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-29-2025 01:34 AM
Hello,
I have two codes with the exact same outcome, one takes 7-10 minutes to load, and the other takes exactly 3 seconds, and I'm just trying to understand why:
This takes 7-10 minutes:
F_IntakeStepsPerDay = spark.sql("""
WITH BASE AS (
SELECT
s.JobApplicationFK,
s.StepDate,
ja.CandidateFK
FROM steps AS s
INNER JOIN jobapplication AS ja ON ja.RawDataIsCurrent = 1
WHERE s.RawDataIsCurrent = 1
AND s.StepDate >= DATE_SUB(CURRENT_DATE(), 35)
),
REPARTITIONED_BASE AS (
SELECT * FROM BASE DISTRIBUTE BY JobApplicationFK
),
BASE_WITH_NEXT AS (
SELECT
b.*,
LEAD(b.StepDate) OVER (PARTITION BY b.JobApplicationFK ORDER BY b.StepDate) AS NextStepDate
FROM REPARTITIONED_BASE b
),
JOIN_WITH_DATE AS (
SELECT /*+ RANGE_JOIN(f, 7) */
f.JobApplicationFK,
f.CandidateFK,
f.StepDate,
f.NextStepDate
FROM BASE_WITH_NEXT f
INNER JOIN dim_date d
ON d.Date >= f.StepDate
AND d.Date < COALESCE(f.NextStepDate, DATE_ADD(f.StepDate, 1))
)
SELECT *
FROM JOIN_WITH_DATE
ORDER BY JobApplicationFK, StepDate
""")
display(F_IntakeStepsPerDay)
This takes 3 seconds:
from pyspark.sql.functions import col, lead, expr
from pyspark.sql.window import Window
df = (
spark.table("steps").alias("s")
.join(spark.table("jobapplication").alias("ja"), on=col("s.JobApplicationFK") == col("ja.JobApplicationBK"))
.filter(
(col("s.RawDataIsCurrent") == 1) &
(col("ja.RawDataIsCurrent") == 1) &
(col("s.StepDate") >= expr("DATE_SUB(CURRENT_DATE(), 35)"))
)
.select("s.JobApplicationFK", "s.StepDate", "ja.CandidateFK")
)
df_repartitioned = df.repartition("JobApplicationFK")
window_spec = Window.partitionBy("JobApplicationFK").orderBy("StepDate")
df_with_next = df_repartitioned.withColumn("NextStepDate", lead("StepDate").over(window_spec))
df_with_next.createOrReplaceTempView("BASE_WITH_NEXT")
spark.catalog.cacheTable("BASE_WITH_NEXT")
F_IntakeStepsPerDay = spark.sql("""
SELECT /*+ RANGE_JOIN(f, 7) */
d.Date,
f.JobApplicationFK,
f.CandidateFK,
f.StepDate,
f.NextStepDate
FROM BASE_WITH_NEXT f
INNER JOIN dim_date d ON d.Date >= f.StepDate AND d.Date < COALESCE(f.NextStepDate, DATE_ADD(f.StepDate, 1))
ORDER BY f.JobApplicationFK, f.StepDate
""")
F_IntakeStepsPerDay.display()
I just want to understand why.
Can anyone explain please? just to clarify, even without the repartition in my pyspark code and even without the cache, it still takes 2-3 seconds!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-30-2025 09:12 PM
The code is not a apple to apple comparison, and debugging with the help of Spark UI, plan can give a better understanding.
But reviewing the code I can see in the PySpark implementation, you explicitly repartition the DataFrame (repartition("JobApplicationFK")), which helps in optimizing the data distribution for subsequent operations. This reduces the amount of shuffling during LEAD() window function and join operations. But you also mentioned that you tested without repartition, can you please check on the Spark UI. DAG review can give you insights on where more time has been taken.
On the other hand, the SQL implementation uses DISTRIBUTE BY, which might not achieve the same level of optimization depending on the actual execution plan generated.