<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: spark.sql with CTEs (10 minutes) VS pyspark code + spark.sql (without CTE) (3 seconds), why? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-sql-with-ctes-10-minutes-vs-pyspark-code-spark-sql-without/m-p/117280#M45463</link>
    <description>&lt;P&gt;The code is not a apple to apple comparison, and debugging with the help of Spark UI, plan can give a better understanding.&lt;/P&gt;
&lt;P&gt;But reviewing the code I can see in the PySpark implementation, you explicitly repartition the DataFrame (&lt;CODE&gt;repartition("JobApplicationFK")&lt;/CODE&gt;), which helps in optimizing the data distribution for subsequent operations. This reduces the amount of shuffling during &lt;CODE&gt;LEAD()&lt;/CODE&gt; window function and join operations. But you also mentioned that you tested without repartition, can you please check on the Spark UI. DAG review can give you insights on where more time has been taken.&lt;/P&gt;
&lt;P&gt;On the other hand, the SQL implementation uses &lt;CODE&gt;DISTRIBUTE BY&lt;/CODE&gt;, which &lt;STRONG&gt;might&lt;/STRONG&gt; not achieve the same level of optimization depending on the actual execution plan generated.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 01 May 2025 04:12:37 GMT</pubDate>
    <dc:creator>NandiniN</dc:creator>
    <dc:date>2025-05-01T04:12:37Z</dc:date>
    <item>
      <title>spark.sql with CTEs (10 minutes) VS pyspark code + spark.sql (without CTE) (3 seconds), why?</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-sql-with-ctes-10-minutes-vs-pyspark-code-spark-sql-without/m-p/116935#M45394</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I have two codes with the exact same outcome, one takes 7-10 minutes to load, and the other takes exactly 3 seconds, and I'm just trying to understand why:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;This takes 7-10 minutes:&lt;/STRONG&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;F_IntakeStepsPerDay = spark.sql("""
WITH BASE AS (
    SELECT
        s.JobApplicationFK,
        s.StepDate,
        ja.CandidateFK
    FROM steps AS s
    INNER JOIN jobapplication AS ja ON ja.RawDataIsCurrent = 1
    WHERE s.RawDataIsCurrent = 1
      AND s.StepDate &amp;gt;= DATE_SUB(CURRENT_DATE(), 35)
),

REPARTITIONED_BASE AS (
    SELECT * FROM BASE DISTRIBUTE BY JobApplicationFK
),

BASE_WITH_NEXT AS (
    SELECT
        b.*,
        LEAD(b.StepDate) OVER (PARTITION BY b.JobApplicationFK ORDER BY b.StepDate) AS NextStepDate
    FROM REPARTITIONED_BASE b
),

JOIN_WITH_DATE AS ( 
    SELECT /*+ RANGE_JOIN(f, 7) */
        f.JobApplicationFK,
        f.CandidateFK,
        f.StepDate,
        f.NextStepDate
    FROM BASE_WITH_NEXT f
    INNER JOIN dim_date d 
      ON d.Date &amp;gt;= f.StepDate 
     AND d.Date &amp;lt; COALESCE(f.NextStepDate, DATE_ADD(f.StepDate, 1))
)

SELECT *
FROM JOIN_WITH_DATE 
ORDER BY JobApplicationFK, StepDate
""")

display(F_IntakeStepsPerDay)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;This takes 3 seconds:&lt;/STRONG&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql.functions import col, lead, expr
from pyspark.sql.window import Window

df = (
    spark.table("steps").alias("s")
    .join(spark.table("jobapplication").alias("ja"), on=col("s.JobApplicationFK") == col("ja.JobApplicationBK"))
    .filter(
        (col("s.RawDataIsCurrent") == 1) &amp;amp;
        (col("ja.RawDataIsCurrent") == 1) &amp;amp;
        (col("s.StepDate") &amp;gt;= expr("DATE_SUB(CURRENT_DATE(), 35)"))
    )
    .select("s.JobApplicationFK", "s.StepDate", "ja.CandidateFK")
)

df_repartitioned = df.repartition("JobApplicationFK")

window_spec = Window.partitionBy("JobApplicationFK").orderBy("StepDate")
df_with_next = df_repartitioned.withColumn("NextStepDate", lead("StepDate").over(window_spec))

df_with_next.createOrReplaceTempView("BASE_WITH_NEXT")
spark.catalog.cacheTable("BASE_WITH_NEXT")

F_IntakeStepsPerDay = spark.sql("""
SELECT /*+ RANGE_JOIN(f, 7) */
    d.Date,
    f.JobApplicationFK,
    f.CandidateFK,
    f.StepDate,
    f.NextStepDate
FROM BASE_WITH_NEXT f
INNER JOIN dim_date d ON d.Date &amp;gt;= f.StepDate AND d.Date &amp;lt; COALESCE(f.NextStepDate, DATE_ADD(f.StepDate, 1))
ORDER BY f.JobApplicationFK, f.StepDate
""")

F_IntakeStepsPerDay.display()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I just want to understand why.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can anyone explain please? just to clarify, even without the repartition in my pyspark code and even without the cache, it still takes 2-3 seconds!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 29 Apr 2025 08:34:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-sql-with-ctes-10-minutes-vs-pyspark-code-spark-sql-without/m-p/116935#M45394</guid>
      <dc:creator>biafch</dc:creator>
      <dc:date>2025-04-29T08:34:36Z</dc:date>
    </item>
    <item>
      <title>Re: spark.sql with CTEs (10 minutes) VS pyspark code + spark.sql (without CTE) (3 seconds), why?</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-sql-with-ctes-10-minutes-vs-pyspark-code-spark-sql-without/m-p/117280#M45463</link>
      <description>&lt;P&gt;The code is not a apple to apple comparison, and debugging with the help of Spark UI, plan can give a better understanding.&lt;/P&gt;
&lt;P&gt;But reviewing the code I can see in the PySpark implementation, you explicitly repartition the DataFrame (&lt;CODE&gt;repartition("JobApplicationFK")&lt;/CODE&gt;), which helps in optimizing the data distribution for subsequent operations. This reduces the amount of shuffling during &lt;CODE&gt;LEAD()&lt;/CODE&gt; window function and join operations. But you also mentioned that you tested without repartition, can you please check on the Spark UI. DAG review can give you insights on where more time has been taken.&lt;/P&gt;
&lt;P&gt;On the other hand, the SQL implementation uses &lt;CODE&gt;DISTRIBUTE BY&lt;/CODE&gt;, which &lt;STRONG&gt;might&lt;/STRONG&gt; not achieve the same level of optimization depending on the actual execution plan generated.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 01 May 2025 04:12:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-sql-with-ctes-10-minutes-vs-pyspark-code-spark-sql-without/m-p/117280#M45463</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-05-01T04:12:37Z</dc:date>
    </item>
  </channel>
</rss>

