<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Databricks optimization for query perfomance and pipeline run in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/databricks-optimization-for-query-perfomance-and-pipeline-run/m-p/153062#M53925</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/190062"&gt;@sai_sakhamuri&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You're clearly past the basics. Let me give you a practitioner-level breakdown of each layer you mentioned, plus a few things that often get overlooked.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Spark Catalyst Optimizer — Working With the Rules Engine&lt;/STRONG&gt;&lt;BR /&gt;Catalyst operates in four phases: Analysis → Logical Optimization → Physical Planning → Code Generation. Most developers only think about the physical plan, but the biggest leverage is earlier.&lt;BR /&gt;&lt;STRONG&gt;Practical tips:&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;Predicate pushdown is not always automatic.&lt;/STRONG&gt; When reading from JDBC sources or custom data source V2 connectors, Catalyst can't always push filters down. Explicitly filter before any joins/aggregations and verify with df.explain("extended") that predicates appear in the PushedFilters section.&lt;BR /&gt;&lt;STRONG&gt;Column pruning breaks with select *&lt;/STRONG&gt;. If you use SELECT * anywhere upstream, Catalyst cannot prune unused columns from file reads. Always project only what you need, as early as possible.&lt;BR /&gt;&lt;STRONG&gt;Avoid UDFs when a native expression exists&lt;/STRONG&gt;. Catalyst cannot optimize inside a UDF — it's a black box. Even simple UDFs defeat predicate pushdown on the columns they touch. Prefer expr(), when/otherwise, and functions.* equivalents.&lt;BR /&gt;&lt;STRONG&gt;Use&lt;/STRONG&gt; explain("cost") (Spark 3+) to see the row count and size estimates Catalyst is working with. If its estimates are wildly off, your statistics are stale — ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS fixes this for Hive metastore tables.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Tungsten Engine — The Binary/Codegen Layer&lt;/STRONG&gt;&lt;BR /&gt;Tungsten is responsible for off-heap memory management and whole-stage code generation (WSCG). It compiles operator chains into a single tight Java bytecode loop, eliminating virtual dispatch and intermediate object allocation.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Practical tips:&lt;/STRONG&gt;&lt;BR /&gt;Watch for WSCG breaking. Some operators like Window, certain aggregate types, and very wide schemas (200+ columns) fall back to the interpreted path. explain() will show *(1) Project for WSCG-compiled stages and a plain Project for fallback. If you see a lot of unstarred operators on a hot path, that's a regression worth investigating.&lt;BR /&gt;Control off-heap memory explicitly. Set spark.memory.offHeap.enabled=true and spark.memory.offHeap.size if your executors are under GC pressure. Tungsten's binary format avoids GC entirely for intermediate data, which matters enormously for large sort/agg operations.&lt;BR /&gt;Avoid Kryo serialization for columnar operations. Tungsten's binary row format and Kryo don't mix well. Keep Kryo for RDD-based legacy code; for DataFrames/Datasets, let Tungsten manage serialization.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Adaptive Query Execution (AQE) — Your Best Friend for ETL:&lt;/STRONG&gt;&lt;BR /&gt;AQE (enabled by default in Spark 3.2+) re-optimizes the physical plan at runtime using actual partition statistics after each shuffle stage. For high-volume ETL, this is often the single highest-ROI lever.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;The three core features and how to tune them:&lt;/STRONG&gt;&lt;BR /&gt;spark.sql.adaptive.coalescePartitions.enabled&lt;BR /&gt;spark.sql.adaptive.join.enabled&lt;BR /&gt;spark.sql.adaptive.skewJoin.enabled&lt;/P&gt;&lt;P&gt;One often-missed AQE gotcha: AQE only activates after a shuffle boundary. If your pipeline has no shuffles (e.g., it's a pure filter + project on a partitioned table), AQE does nothing. Structure your stages to give AQE observable shuffle stages to work with.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Other Architectural Levers Worth Exploring:&lt;/STRONG&gt;&lt;BR /&gt;Dynamic Partition Pruning (DPP): When joining a large fact table to a small dimension table, Spark 3 can push the dimension's filter as a subquery into the fact table scan — eliminating entire file partitions before they're read. Requires spark.sql.optimizer.dynamicPartitionPruning.enabled=true and a broadcast-eligible dimension side.&lt;BR /&gt;Bucketing for repeat joins: If you join the same two large tables repeatedly (e.g., in a daily ETL), bucketing on the join key eliminates the shuffle entirely on every subsequent run. The upfront write cost pays off fast.&lt;BR /&gt;Z-ordering / liquid clustering (Delta Lake): If you're on Delta, Z-ordering co-locates related data physically, which compounds with DPP and file skipping. Liquid Clustering (Delta 3.1+) makes this maintenance-free.&lt;BR /&gt;Scan coalescing via file layout: Thousands of small files kill scan performance before any optimizer can help. A periodic compaction job (or OPTIMIZE on Delta) is often worth more than any query-level tuning.&lt;/P&gt;&lt;P&gt;The general diagnostic workflow I'd suggest: run explain("formatted") on your slowest stages, look for broken WSCG, check AQE's runtime plan diff (Spark UI → SQL tab → "AQE plan"), then verify statistics freshness. Most ETL regressions trace back to one of those three things.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 02 Apr 2026 15:10:01 GMT</pubDate>
    <dc:creator>lingareddy_Alva</dc:creator>
    <dc:date>2026-04-02T15:10:01Z</dc:date>
    <item>
      <title>Databricks optimization for query perfomance and pipeline run</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-optimization-for-query-perfomance-and-pipeline-run/m-p/153057#M53921</link>
      <description>&lt;P&gt;I am currently working on optimizing several Spark pipelines and wanted to gather community insights on advanced performance tuning.&amp;nbsp;&lt;SPAN&gt;&lt;SPAN class=""&gt;Typically, my workflow for traditional SQL optimization involves a deep dive into the &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;execution plan&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt; to identify bottlenecks and develop a targeted optimization strategy&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN&gt;. &lt;/SPAN&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;Within Spark, I’ve focused heavily on &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;pipeline optimization&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt;, specifically addressing join performance through &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;broadcast joins&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt; and mitigating shuffle delays&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I am interested in moving beyond these standard techniques. &lt;/SPAN&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;Specifically, I’ve been hearing from others developers about the &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;Spark Catalyst Optimizer&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;SPAN class=""&gt; and the &lt;/SPAN&gt;&lt;STRONG&gt;&lt;SPAN class=""&gt;Tungsten Engine&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;SPAN&gt;. &lt;/SPAN&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;I’m curious if anyone has practical experience or "under-the-hood" tips on how to better leverage these engines—or if there are other architectural optimizations (like Adaptive Query Execution) that you’ve found effective for high-volume ETL&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 02 Apr 2026 14:38:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-optimization-for-query-perfomance-and-pipeline-run/m-p/153057#M53921</guid>
      <dc:creator>sai_sakhamuri</dc:creator>
      <dc:date>2026-04-02T14:38:41Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks optimization for query perfomance and pipeline run</title>
      <link>https://community.databricks.com/t5/data-engineering/databricks-optimization-for-query-perfomance-and-pipeline-run/m-p/153062#M53925</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/190062"&gt;@sai_sakhamuri&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You're clearly past the basics. Let me give you a practitioner-level breakdown of each layer you mentioned, plus a few things that often get overlooked.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Spark Catalyst Optimizer — Working With the Rules Engine&lt;/STRONG&gt;&lt;BR /&gt;Catalyst operates in four phases: Analysis → Logical Optimization → Physical Planning → Code Generation. Most developers only think about the physical plan, but the biggest leverage is earlier.&lt;BR /&gt;&lt;STRONG&gt;Practical tips:&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;Predicate pushdown is not always automatic.&lt;/STRONG&gt; When reading from JDBC sources or custom data source V2 connectors, Catalyst can't always push filters down. Explicitly filter before any joins/aggregations and verify with df.explain("extended") that predicates appear in the PushedFilters section.&lt;BR /&gt;&lt;STRONG&gt;Column pruning breaks with select *&lt;/STRONG&gt;. If you use SELECT * anywhere upstream, Catalyst cannot prune unused columns from file reads. Always project only what you need, as early as possible.&lt;BR /&gt;&lt;STRONG&gt;Avoid UDFs when a native expression exists&lt;/STRONG&gt;. Catalyst cannot optimize inside a UDF — it's a black box. Even simple UDFs defeat predicate pushdown on the columns they touch. Prefer expr(), when/otherwise, and functions.* equivalents.&lt;BR /&gt;&lt;STRONG&gt;Use&lt;/STRONG&gt; explain("cost") (Spark 3+) to see the row count and size estimates Catalyst is working with. If its estimates are wildly off, your statistics are stale — ANALYZE TABLE ... COMPUTE STATISTICS FOR COLUMNS fixes this for Hive metastore tables.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Tungsten Engine — The Binary/Codegen Layer&lt;/STRONG&gt;&lt;BR /&gt;Tungsten is responsible for off-heap memory management and whole-stage code generation (WSCG). It compiles operator chains into a single tight Java bytecode loop, eliminating virtual dispatch and intermediate object allocation.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Practical tips:&lt;/STRONG&gt;&lt;BR /&gt;Watch for WSCG breaking. Some operators like Window, certain aggregate types, and very wide schemas (200+ columns) fall back to the interpreted path. explain() will show *(1) Project for WSCG-compiled stages and a plain Project for fallback. If you see a lot of unstarred operators on a hot path, that's a regression worth investigating.&lt;BR /&gt;Control off-heap memory explicitly. Set spark.memory.offHeap.enabled=true and spark.memory.offHeap.size if your executors are under GC pressure. Tungsten's binary format avoids GC entirely for intermediate data, which matters enormously for large sort/agg operations.&lt;BR /&gt;Avoid Kryo serialization for columnar operations. Tungsten's binary row format and Kryo don't mix well. Keep Kryo for RDD-based legacy code; for DataFrames/Datasets, let Tungsten manage serialization.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Adaptive Query Execution (AQE) — Your Best Friend for ETL:&lt;/STRONG&gt;&lt;BR /&gt;AQE (enabled by default in Spark 3.2+) re-optimizes the physical plan at runtime using actual partition statistics after each shuffle stage. For high-volume ETL, this is often the single highest-ROI lever.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;The three core features and how to tune them:&lt;/STRONG&gt;&lt;BR /&gt;spark.sql.adaptive.coalescePartitions.enabled&lt;BR /&gt;spark.sql.adaptive.join.enabled&lt;BR /&gt;spark.sql.adaptive.skewJoin.enabled&lt;/P&gt;&lt;P&gt;One often-missed AQE gotcha: AQE only activates after a shuffle boundary. If your pipeline has no shuffles (e.g., it's a pure filter + project on a partitioned table), AQE does nothing. Structure your stages to give AQE observable shuffle stages to work with.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Other Architectural Levers Worth Exploring:&lt;/STRONG&gt;&lt;BR /&gt;Dynamic Partition Pruning (DPP): When joining a large fact table to a small dimension table, Spark 3 can push the dimension's filter as a subquery into the fact table scan — eliminating entire file partitions before they're read. Requires spark.sql.optimizer.dynamicPartitionPruning.enabled=true and a broadcast-eligible dimension side.&lt;BR /&gt;Bucketing for repeat joins: If you join the same two large tables repeatedly (e.g., in a daily ETL), bucketing on the join key eliminates the shuffle entirely on every subsequent run. The upfront write cost pays off fast.&lt;BR /&gt;Z-ordering / liquid clustering (Delta Lake): If you're on Delta, Z-ordering co-locates related data physically, which compounds with DPP and file skipping. Liquid Clustering (Delta 3.1+) makes this maintenance-free.&lt;BR /&gt;Scan coalescing via file layout: Thousands of small files kill scan performance before any optimizer can help. A periodic compaction job (or OPTIMIZE on Delta) is often worth more than any query-level tuning.&lt;/P&gt;&lt;P&gt;The general diagnostic workflow I'd suggest: run explain("formatted") on your slowest stages, look for broken WSCG, check AQE's runtime plan diff (Spark UI → SQL tab → "AQE plan"), then verify statistics freshness. Most ETL regressions trace back to one of those three things.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 02 Apr 2026 15:10:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/databricks-optimization-for-query-perfomance-and-pipeline-run/m-p/153062#M53925</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2026-04-02T15:10:01Z</dc:date>
    </item>
  </channel>
</rss>

