<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>Community Articles topics</title>
    <link>https://community.databricks.com/t5/community-articles/bd-p/Knowledge-Sharing-Hub</link>
    <description>Community Articles topics</description>
    <pubDate>Mon, 29 Jun 2026 00:44:05 GMT</pubDate>
    <dc:creator>Knowledge-Sharing-Hub</dc:creator>
    <dc:date>2026-06-29T00:44:05Z</dc:date>
    <item>
      <title>LTAP: What Databricks New Transactional-Analytical Architecture Means for Data Engineers</title>
      <link>https://community.databricks.com/t5/community-articles/ltap-what-databricks-new-transactional-analytical-architecture/m-p/160742#M1324</link>
      <description>&lt;P&gt;For years, enterprise data architecture has followed a familiar pattern.&lt;/P&gt;&lt;P&gt;An application writes customer orders, account updates, inventory changes, or transactions into an operational database.&lt;/P&gt;&lt;P&gt;Then data engineering takes over.&lt;/P&gt;&lt;P&gt;We capture changes through CDC. We land them in a lake or warehouse. We transform them through multiple layers. We create curated tables for analytics. Then, in many cases, we move enriched data back into an application through APIs, reverse ETL, or another synchronization process.&lt;/P&gt;&lt;P&gt;A simplified version looks like this:&lt;/P&gt;&lt;PRE&gt;Operational Application Database
        ↓
CDC / Replication
        ↓
Landing Layer
        ↓
Transformation Pipelines
        ↓
Lakehouse / Warehouse
        ↓
Dashboards, ML Models, AI Agents
        ↓
Reverse ETL / API / Application Sync&lt;/PRE&gt;&lt;P&gt;This architecture is common for a reason. It works.&lt;/P&gt;&lt;P&gt;But it also creates several familiar problems:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Multiple copies of the same business entity&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Delays between application activity and analytical availability&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;CDC failures and schema drift&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Reconciliation effort between operational and analytical views&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Complex reverse ETL or API layers&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Different governance models across different systems&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;AI applications operating on stale or incomplete context&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Databricks’ new LTAP architecture is interesting because it challenges the assumption that transactional and analytical data must always live in separate worlds.&lt;/P&gt;&lt;H2&gt;What Is LTAP?&lt;/H2&gt;&lt;P&gt;LTAP stands for &lt;STRONG&gt;Lake Transactional/Analytical Processing&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;The idea is not simply to run OLTP and OLAP workloads inside one engine.&lt;/P&gt;&lt;P&gt;Instead, LTAP aims to bring transactional, analytical, streaming, and AI application workloads closer to a shared governed data foundation.&lt;/P&gt;&lt;P&gt;Databricks positions Lakebase as the transactional layer in this model: a managed Postgres-compatible database integrated with the broader Databricks platform. The architectural goal is to reduce the need for separate copies, replicated pipelines, and synchronization layers between applications and analytics.&lt;/P&gt;&lt;P&gt;In simple terms, LTAP asks:&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;What if an operational application, an analytics team, and an AI agent could work from a much closer version of the same governed data foundation?&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;That is a meaningful question for data engineers.&lt;/P&gt;&lt;H2&gt;The Traditional Gap Between OLTP and OLAP&lt;/H2&gt;&lt;P&gt;Let us take a simple customer-order scenario.&lt;/P&gt;&lt;P&gt;A customer places an order through an e-commerce application.&lt;/P&gt;&lt;P&gt;The application writes the order into an operational database.&lt;/P&gt;&lt;P&gt;The data engineering team then captures the change, transforms it, enriches it with customer and product data, and publishes it to analytics tables.&lt;/P&gt;&lt;P&gt;Later, an AI assistant may use that data to answer questions such as:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Why did this customer’s order fail?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Is this customer eligible for a retention offer?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;What products are frequently purchased together?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Is there a fraud or fulfillment risk?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Should the application trigger a proactive action?&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;In a traditional architecture, each of those steps may involve separate systems and delayed synchronization.&lt;/P&gt;&lt;PRE&gt;Application Database
        ↓
CDC Pipeline
        ↓
Bronze / Raw Layer
        ↓
Silver / Cleansed Layer
        ↓
Gold / Analytics Layer
        ↓
Feature Store / API / Reverse ETL
        ↓
Application or AI Agent&lt;/PRE&gt;&lt;P&gt;The problem is not that any individual layer is bad.&lt;/P&gt;&lt;P&gt;The problem is that every handoff creates additional operational responsibility.&lt;/P&gt;&lt;P&gt;Someone must monitor the pipeline.&lt;/P&gt;&lt;P&gt;Someone must handle a failed CDC batch.&lt;/P&gt;&lt;P&gt;Someone must reconcile the dashboard number with the application number.&lt;/P&gt;&lt;P&gt;Someone must decide what happens when the source schema changes.&lt;/P&gt;&lt;P&gt;Someone must explain why the AI assistant used yesterday’s data while the application showed a newer transaction.&lt;/P&gt;&lt;P&gt;LTAP does not make these concerns disappear completely. But it creates a new architectural option for reducing unnecessary distance between the application, the data platform, and the intelligence layer.&lt;/P&gt;&lt;H2&gt;What Changes With LTAP?&lt;/H2&gt;&lt;P&gt;The most important shift is not “Databricks now supports transactions.”&lt;/P&gt;&lt;P&gt;The more important shift is:&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;The transactional and analytical worlds can be designed around a more unified storage and governance foundation.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;That could reduce several common integration patterns:&lt;/P&gt;&lt;PRE&gt;Before:
Operational DB → CDC → Lakehouse → Reverse ETL → Application

Potential LTAP Pattern:
Application + Operational Data + Analytics + AI Context
        ↓
Shared Governed Data Foundation&lt;/PRE&gt;&lt;P&gt;For the right use cases, this can reduce:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Data replication&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Pipeline latency&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Reconciliation complexity&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Reverse ETL maintenance&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Fragmented security models&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Duplicate lineage documentation&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Delayed context for AI-powered applications&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;However, this does not mean every existing operational database should be moved immediately.&lt;/P&gt;&lt;P&gt;LTAP is a design option, not a universal replacement strategy.&lt;/P&gt;&lt;H2&gt;A Practical Example: Customer Support and Fraud Review&lt;/H2&gt;&lt;P&gt;Consider a customer-support or fraud-investigation workflow.&lt;/P&gt;&lt;P&gt;A support agent needs to see the latest customer profile, recent transactions, risk indicators, product history, and open service cases.&lt;/P&gt;&lt;P&gt;A fraud analyst needs historical behavior, anomaly scores, device patterns, and transaction trends.&lt;/P&gt;&lt;P&gt;An AI assistant needs governed context before it recommends an action.&lt;/P&gt;&lt;P&gt;In a traditional architecture, these could be spread across:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;An application database&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;A CRM system&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;A data warehouse&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;A feature store&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;A vector database&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;A reverse ETL tool&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Several APIs&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;That means the agent or application may operate on a mixture of current and delayed information.&lt;/P&gt;&lt;P&gt;An LTAP-style architecture could allow teams to design this more directly:&lt;/P&gt;&lt;PRE&gt;Customer Transaction
        ↓
Transactional Operational State
        ↓
Shared Governed Data Foundation
        ↓
Analytics / Risk Models / AI Agent Context
        ↓
Human or Application Action&lt;/PRE&gt;&lt;P&gt;The value is not simply speed.&lt;/P&gt;&lt;P&gt;The value is that operational action, analytical understanding, and AI recommendation can be designed around more consistent data context.&lt;/P&gt;&lt;H2&gt;Where Data Engineers Still Matter&lt;/H2&gt;&lt;P&gt;There is a temptation with new platform architectures to assume that fewer pipelines means less data engineering.&lt;/P&gt;&lt;P&gt;I see it differently.&lt;/P&gt;&lt;P&gt;LTAP may reduce unnecessary plumbing, but it makes data engineering decisions even more important.&lt;/P&gt;&lt;P&gt;Teams will still need to design:&lt;/P&gt;&lt;H3&gt;1. Workload Boundaries&lt;/H3&gt;&lt;P&gt;Not every workload needs real-time access.&lt;/P&gt;&lt;P&gt;Some data should remain asynchronous because of cost, scale, reliability, or business-process requirements.&lt;/P&gt;&lt;P&gt;A daily finance reconciliation process does not necessarily need the same architecture as a real-time fraud decision.&lt;/P&gt;&lt;H3&gt;2. Data Contracts&lt;/H3&gt;&lt;P&gt;If operational and analytical workloads are closer together, schema discipline becomes more important.&lt;/P&gt;&lt;P&gt;A small application-side schema change can have downstream impact on:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Analytics&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Machine learning features&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;AI agent context&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Data quality rules&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Regulatory reports&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Customer-facing workflows&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Data contracts, schema evolution rules, and impact analysis remain essential.&lt;/P&gt;&lt;H3&gt;3. Governance and Access Controls&lt;/H3&gt;&lt;P&gt;A single governed foundation is valuable only when access controls are designed properly.&lt;/P&gt;&lt;P&gt;Teams still need to define:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Which users can read transactional data&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Which users can update it&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Which data can be exposed to AI agents&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;How sensitive fields are masked&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;How access is audited&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;How long data is retained&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;How recovery and rollback work&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;This is where unified governance can become more valuable than simply reducing pipeline count.&lt;/P&gt;&lt;H3&gt;4. Data Quality and Reconciliation&lt;/H3&gt;&lt;P&gt;LTAP may reduce copies, but it does not remove data-quality issues.&lt;/P&gt;&lt;P&gt;Bad source data is still bad data.&lt;/P&gt;&lt;P&gt;Missing customer identifiers, incorrect product mappings, unexpected nulls, duplicate transactions, and invalid business rules still need validation.&lt;/P&gt;&lt;P&gt;The difference is that data quality checks can potentially be designed closer to the point where operational and analytical decisions meet.&lt;/P&gt;&lt;H3&gt;5. Human Approval for AI Actions&lt;/H3&gt;&lt;P&gt;As AI agents move from answering questions to recommending or triggering actions, governance becomes critical.&lt;/P&gt;&lt;P&gt;An agent that sees fresh customer data is useful.&lt;/P&gt;&lt;P&gt;An agent that can trigger a customer action, change a workflow, or make a financial recommendation without review is a governance risk.&lt;/P&gt;&lt;P&gt;The future architecture needs more than real-time data.&lt;/P&gt;&lt;P&gt;It needs:&lt;/P&gt;&lt;PRE&gt;Trusted Data
→ Validated Context
→ AI Recommendation
→ Human Review or Policy Check
→ Approved Action&lt;/PRE&gt;&lt;P&gt;That is where data engineering, governance, and AI engineering come together.&lt;/P&gt;&lt;H2&gt;LTAP Does Not Eliminate ETL&lt;/H2&gt;&lt;P&gt;It is important to be realistic.&lt;/P&gt;&lt;P&gt;There will still be ETL, ELT, streaming transformations, data modeling, quality checks, and integration work.&lt;/P&gt;&lt;P&gt;Organizations will continue to have:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;SaaS applications&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Mainframes&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Third-party platforms&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Vendor APIs&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Legacy operational systems&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Regulatory reporting requirements&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Historical archives&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Domain-specific data products&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;LTAP will not magically eliminate those realities.&lt;/P&gt;&lt;P&gt;But it may reduce a category of pipelines that exist only because operational and analytical environments are disconnected by default.&lt;/P&gt;&lt;P&gt;That is a meaningful architectural shift.&lt;/P&gt;&lt;H2&gt;Questions I Would Ask Before Adopting LTAP&lt;/H2&gt;&lt;P&gt;Before adopting LTAP for an enterprise use case, I would ask:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;Which current pipelines exist only to synchronize operational and analytical copies?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Which workflows truly need low-latency operational plus analytical context?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Which workloads must remain isolated for performance, reliability, or compliance?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;What data contracts are required before operational and analytical consumers share the same foundation?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;How will schema changes be governed?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;How will AI agents access transactional context safely?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;What approval and audit controls are needed for agent-driven actions?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;How will teams measure whether LTAP reduces cost, latency, incidents, or reconciliation effort?&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;These questions keep the discussion practical.&lt;/P&gt;&lt;H2&gt;Final Thought&lt;/H2&gt;&lt;P&gt;The most interesting part of LTAP is not that it promises fewer pipelines.&lt;/P&gt;&lt;P&gt;It is that it gives enterprises a new way to think about the relationship between:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Operational applications&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Transactional data&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Streaming data&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Analytics&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;AI agents&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Governance&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;For a long time, we accepted that those systems had to be connected through layers of copying, synchronization, and operational glue.&lt;/P&gt;&lt;P&gt;LTAP suggests that for some use cases, they can be designed around a closer and more governed foundation.&lt;/P&gt;&lt;P&gt;For data engineers, that does not reduce the importance of architecture.&lt;/P&gt;&lt;P&gt;It raises the importance of getting the architecture right.&lt;/P&gt;&lt;P&gt;The future will not be “one platform for everything.”&lt;/P&gt;&lt;P&gt;The future will be choosing the right boundary between real-time operational needs, analytical scale, governance, and human accountability.&lt;/P&gt;</description>
      <pubDate>Sat, 27 Jun 2026 13:06:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/ltap-what-databricks-new-transactional-analytical-architecture/m-p/160742#M1324</guid>
      <dc:creator>AmitDECopilot</dc:creator>
      <dc:date>2026-06-27T13:06:30Z</dc:date>
    </item>
    <item>
      <title>Reading Spark UI: A Repeatable Guide to Finding Performance Bottlenecks</title>
      <link>https://community.databricks.com/t5/community-articles/reading-spark-ui-a-repeatable-guide-to-finding-performance/m-p/160574#M1320</link>
      <description>&lt;DIV style="max-width: 860px; margin: 0 auto; padding: 0 32px 80px; background: #FFFFFF; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif; color: #1b3139; line-height: 1.75; font-size: 17px;"&gt;
&lt;P&gt;A &lt;A href="https://community.databricks.com/t5/data-engineering/spark-ui-troubleshooting-data-skew-vs-cluster-resource/td-p/160517" target="_self"&gt;question&lt;/A&gt; came up in the community recently that I thought deserved more than a short answer. The question was around how to build a reliable investigation sequence for slow Spark jobs, specifically when symptoms overlap. A long-running stage with high spill and a few slow tasks could be data skew, insufficient executor memory, too few partitions, or an inefficient join strategy. The Spark UI has all the information you need to tell them apart, but only if you know where to look and in what order.&lt;/P&gt;
&lt;P&gt;I put together a lab notebook with three intentionally broken jobs, one per bottleneck type, ran them on a Databricks cluster with Photon disabled to expose the classic Spark signatures, and captured every screenshot from a real run. This post walks through what each bottleneck looks like in the UI, the fix, and how to confirm the fix actually worked. The goal is a sequence you can apply directly the next time a stage takes longer than it should.&lt;/P&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #FF3621; padding: 20px 24px; margin: 28px 0; border-radius: 6px;"&gt;
&lt;DIV style="font-size: 12px; font-weight: bold; color: #ff3621; text-transform: uppercase; letter-spacing: 1.5px; margin-bottom: 12px;"&gt;Key Takeaways&lt;/DIV&gt;
&lt;UL style="margin: 0; padding-left: 20px; color: #1b3139;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;Start in the Stages tab. Find the slowest stage, then ask three questions in order before touching any configuration.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;Data skew, memory pressure, and underparallelism each produce a distinct Spark UI signature. The Max/Median duration ratio and the spill distribution are the fastest discriminators.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;A single faster run is not validation. The underlying metric (GC time, spill, task ratio) must move, not just wall-clock time.&lt;/LI&gt;
&lt;LI style="margin-bottom: 0;"&gt;Enable AQE before doing any manual tuning. It resolves a large fraction of shuffle partition and broadcast join problems automatically.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;DIV style="background: #FAFBFC; border: 1px solid #E8ECF0; border-radius: 8px; padding: 20px 28px; margin: 24px 0 16px;"&gt;
&lt;DIV style="font-size: 13px; font-weight: bold; color: #ff3621; text-transform: uppercase; letter-spacing: 1.5px; margin-bottom: 12px;"&gt;What's in this post&lt;/DIV&gt;
&lt;OL style="margin: 0; padding-left: 20px;"&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#sequence" target="_blank"&gt;The investigation sequence&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#skew" target="_blank"&gt;Scenario 1: Data skew&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#skew-fix" target="_blank"&gt;Scenario 1b: Broadcast join fix&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#memory" target="_blank"&gt;Scenario 2: Memory pressure&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#memory-fix" target="_blank"&gt;Scenario 2b: Partition fix&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#parallelism" target="_blank"&gt;Scenario 3: Underparallelism&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 6px; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#decision" target="_blank"&gt;Decision map and thresholds&lt;/A&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 0; font-size: 15px;"&gt;&lt;A style="color: #1b3139; text-decoration: none;" href="#validation" target="_blank"&gt;Validating the fix&lt;/A&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;/DIV&gt;
&lt;H2 id="sequence" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 32px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;The investigation sequence&lt;/H2&gt;
&lt;P&gt;Before clicking anything, open the &lt;STRONG&gt;Stages&lt;/STRONG&gt; tab and sort by Duration descending. Pick the longest stage. Everything else is noise until you understand that one stage.&lt;/P&gt;
&lt;P&gt;Inside the stage, you need three numbers before drawing any conclusions:&lt;/P&gt;
&lt;UL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Median task duration&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Max task duration&lt;/STRONG&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Median vs Max shuffle read size per task&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The ratio of Max to Median is your first discriminator. From there, three questions applied in order will identify the primary bottleneck in the large majority of cases.&lt;/P&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #1B3139; padding: 16px 20px; margin: 20px 0; border-radius: 4px; font-size: 15px;"&gt;&lt;STRONG&gt;Question 1:&lt;/STRONG&gt; Is Max Duration more than 5x Median? If yes, check whether shuffle read bytes are also skewed. Both conditions together indicate data skew.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Question 2:&lt;/STRONG&gt; Does spill appear on most tasks (not just outliers)? Is GC time above 10% in the Executors tab? If yes, it is memory pressure.&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Question 3:&lt;/STRONG&gt; Is task count well below 2x your executor core count? If yes, it is underparallelism.&lt;/DIV&gt;
&lt;P&gt;If none of those fit, open the &lt;STRONG&gt;SQL / DataFrame&lt;/STRONG&gt; tab and look at the physical plan. Missing predicate pushdown, unexpected cross joins, or a sort-merge join where broadcast would work are the next places to look.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="skew" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 1: Data skew&lt;/H2&gt;
&lt;P&gt;The lab job joins a 20 million-row fact table where 70% of rows share the same join key (key=1) against a 20-row reference table. Broadcast is explicitly disabled to force a sort-merge join, which means all 20 million rows must be shuffled and sorted by join key before the merge. The partition holding key=1 receives 14 million rows. Every other partition receives a handful.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;&lt;SPAN&gt;from&lt;/SPAN&gt; pyspark.sql &lt;SPAN&gt;import&lt;/SPAN&gt; functions &lt;SPAN&gt;as&lt;/SPAN&gt; F

spark.conf.set(&lt;SPAN&gt;"spark.sql.adaptive.enabled"&lt;/SPAN&gt;, &lt;SPAN&gt;"false"&lt;/SPAN&gt;)
spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"200"&lt;/SPAN&gt;)
spark.conf.set(&lt;SPAN&gt;"spark.sql.autoBroadcastJoinThreshold"&lt;/SPAN&gt;, &lt;SPAN&gt;"-1"&lt;/SPAN&gt;)  &lt;SPAN&gt;# force sort-merge join&lt;/SPAN&gt;

skew_data = (
    spark.range(&lt;SPAN&gt;20_000_000&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"join_key"&lt;/SPAN&gt;,
        F.when(F.rand() &amp;lt; &lt;SPAN&gt;0.70&lt;/SPAN&gt;, F.lit(&lt;SPAN&gt;1&lt;/SPAN&gt;))
         .when(F.rand() &amp;lt; &lt;SPAN&gt;0.85&lt;/SPAN&gt;, F.lit(&lt;SPAN&gt;2&lt;/SPAN&gt;))
         .otherwise((F.rand() * &lt;SPAN&gt;18&lt;/SPAN&gt; + &lt;SPAN&gt;3&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"int"&lt;/SPAN&gt;)))
    .withColumn(&lt;SPAN&gt;"value"&lt;/SPAN&gt;, F.rand() * &lt;SPAN&gt;1000&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"payload"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 50)"&lt;/SPAN&gt;))
)

ref_data = (
    spark.range(&lt;SPAN&gt;1&lt;/SPAN&gt;, &lt;SPAN&gt;21&lt;/SPAN&gt;)
    .withColumnRenamed(&lt;SPAN&gt;"id"&lt;/SPAN&gt;, &lt;SPAN&gt;"join_key"&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"label"&lt;/SPAN&gt;, F.concat(F.lit(&lt;SPAN&gt;"cat_"&lt;/SPAN&gt;), F.col(&lt;SPAN&gt;"join_key"&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"string"&lt;/SPAN&gt;)))
)

result = (
    skew_data.join(ref_data, &lt;SPAN&gt;"join_key"&lt;/SPAN&gt;, &lt;SPAN&gt;"inner"&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"join_key"&lt;/SPAN&gt;, &lt;SPAN&gt;"label"&lt;/SPAN&gt;)
    .agg(F.sum(&lt;SPAN&gt;"value"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"total"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Task Metrics: the 24x ratio&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_0-1782413970227.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28285i68596E863E4E1DB4/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_0-1782413970227.png" alt="Ashwin_DSA_0-1782413970227.png" /&gt;&lt;/span&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
Stage 2 Task Metrics. Median duration 38ms, Max 0.9s: a 24x ratio. Shuffle Read is 0 bytes on 199 of 200 partitions; one partition holds the hot key.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The Task Metrics summary table is the most important screen in the Spark UI for diagnosing skew. Two rows matter here:&lt;/P&gt;
&lt;UL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Duration&lt;/STRONG&gt;: Median 38ms, Max 0.9s. A 24x ratio. The 75th percentile sits at 53ms, meaning the outlier is not just slightly above average. It is categorically different from the rest of the distribution.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Shuffle Read Size&lt;/STRONG&gt;: Median 0 bytes / 0 records across 199 of 200 tasks. The Max partition receives all the data. Most tasks have nothing to process.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This is the defining skew signature: a bimodal distribution where the vast majority of tasks finish in milliseconds and one task runs for orders of magnitude longer.&lt;/P&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Event Timeline: the visual tell&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_1-1782413989929.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28286i0C9CAD53CF88B402/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_1-1782413989929.png" alt="Ashwin_DSA_1-1782413989929.png" /&gt;&lt;/span&gt;&lt;BR /&gt;Event Timeline for Stage 2. A handful of long green (Executor Computing Time) bars at the top, followed by 190+ short bars. This bimodal shape is the skew signature.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The Event Timeline converts the numeric ratio into something immediately visual. Long green bars indicate CPU time spent processing the hot key partition. The colour here matters: green is Executor Computing Time, confirming the slow tasks are CPU-bound on data processing, not waiting on I/O or network. Memory pressure and underparallelism do not produce this bimodal shape, which makes it the fastest visual discriminator between the three bottleneck types.&lt;/P&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;DAG Visualization: confirming the join strategy&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_2-1782414031747.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28287iBAA1D228A920BF64/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_2-1782414031747.png" alt="Ashwin_DSA_2-1782414031747.png" /&gt;&lt;/span&gt;DAG for Stage 2. Two Exchange (shuffle) nodes feed into two Sort operations, which merge via SortMergeJoin inside WholeStageCodegen. Both sides were fully shuffled and sorted by join key, making skew on the join key directly visible as a task outlier.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The DAG confirms the mechanism. Two Exchange nodes (one per join side) followed by Sort operations and a SortMergeJoin. This is the join strategy that makes skew dangerous: every row for key=1 lands on the same partition, and one task must process all of it alone.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="skew-fix" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 1b: Broadcast join fix&lt;/H2&gt;
&lt;P&gt;The reference table has 20 rows. It fits comfortably in executor memory. Enabling broadcast replicates it to every executor, eliminating the need to shuffle the join key entirely. The SortMergeJoin and its two Exchange stages disappear from the plan.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;&lt;SPAN&gt;# Re-enable broadcast: small table replicated to every executor, no shuffle on join key&lt;/SPAN&gt;
spark.conf.set(&lt;SPAN&gt;"spark.sql.autoBroadcastJoinThreshold"&lt;/SPAN&gt;, str(&lt;SPAN&gt;10&lt;/SPAN&gt; * &lt;SPAN&gt;1024&lt;/SPAN&gt; * &lt;SPAN&gt;1024&lt;/SPAN&gt;))

result = (
    skew_data.join(F.broadcast(ref_data), &lt;SPAN&gt;"join_key"&lt;/SPAN&gt;, &lt;SPAN&gt;"inner"&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"join_key"&lt;/SPAN&gt;, &lt;SPAN&gt;"label"&lt;/SPAN&gt;)
    .agg(F.sum(&lt;SPAN&gt;"value"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"total"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Jobs comparison: the numbers&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_3-1782414084869.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28288i8DCA710A0D32A3BD/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_3-1782414084869.png" alt="Ashwin_DSA_3-1782414084869.png" /&gt;&lt;/span&gt;Jobs tab. Scenario 1 (no broadcast): 18s, 4 stages, 408 tasks. Scenario 1b (broadcast): 5s, 2 stages, 204 tasks. Half the stages, half the tasks, 3.6x faster.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The improvement is not incremental. The fix collapses two of the four stages entirely. In production, where a skewed sort-merge join might anchor a 30-minute stage, this is the difference between a job completing before business hours and one that misses its SLA.&lt;/P&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Stages after the fix&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_4-1782414108177.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28289i85F02D8D01FC5C8A/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_4-1782414108177.png" alt="Ashwin_DSA_4-1782414108177.png" /&gt;&lt;/span&gt;Scenario 1b Stages. Two stages: a 4-task scan and a 200-task aggregation shuffle. The 200-task shuffle join stage that produced the 24x task ratio is absent.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The fix did not make the slow task faster. It eliminated the stage that contained the slow task. That is a fundamentally different kind of improvement and it is the one to aim for with skew. If the data allows it, removing the shuffle is better than optimizing within it.&lt;/P&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #FF3621; padding: 16px 20px; margin: 20px 0; border-radius: 4px; font-size: 15px;"&gt;&lt;STRONG&gt;When broadcast is not an option&lt;/STRONG&gt;: if the smaller side of the join is too large to broadcast (typically above 500MB to 1GB depending on executor memory), the next tool is salting. Add a random suffix to the hot key before the join to distribute its rows across multiple partitions, then strip the suffix in a second pass. AQE's skew join optimization (&lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #E8ECF0; padding: 2px 5px; border-radius: 3px; font-size: 13.5px;"&gt;spark.sql.adaptive.skewJoin.enabled&lt;/CODE&gt;) automates a version of this when the skewed partition exceeds the configured threshold.&lt;/DIV&gt;
&lt;HR /&gt;
&lt;H2 id="memory" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 2: Memory pressure&lt;/H2&gt;
&lt;P&gt;The lab job processes 4 million wide rows, each approximately 400 bytes of string data across five columns, shuffled into only 20 partitions. Each task receives roughly 8 MB of data. A &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;collect_list&lt;/CODE&gt; aggregation forces each task to hold the full list of strings in heap before writing the result. The combination of large per-task data and an in-memory accumulator produces GC pressure across all tasks.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"20"&lt;/SPAN&gt;)  &lt;SPAN&gt;# intentionally too few&lt;/SPAN&gt;

wide_data = (
    spark.range(&lt;SPAN&gt;4_000_000&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"group_key"&lt;/SPAN&gt;, (F.col(&lt;SPAN&gt;"id"&lt;/SPAN&gt;) % &lt;SPAN&gt;20&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"int"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_a"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_b"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_c"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_d"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"col_e"&lt;/SPAN&gt;, F.expr(&lt;SPAN&gt;"repeat(cast(rand() as string), 80)"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"metric"&lt;/SPAN&gt;, F.rand() * &lt;SPAN&gt;1000&lt;/SPAN&gt;)
)

result = (
    wide_data
    .repartition(&lt;SPAN&gt;20&lt;/SPAN&gt;, &lt;SPAN&gt;"group_key"&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"group_key"&lt;/SPAN&gt;)
    .agg(
        F.sum(&lt;SPAN&gt;"metric"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"total"&lt;/SPAN&gt;),
        F.collect_list(&lt;SPAN&gt;"col_a"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"all_a"&lt;/SPAN&gt;),  &lt;SPAN&gt;# forces large in-memory buffer&lt;/SPAN&gt;
        F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;)
    )
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Stages: large shuffle volume, few tasks&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_5-1782414148744.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28290iF88D60C7753820FB/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_5-1782414148744.png" alt="Ashwin_DSA_5-1782414148744.png" /&gt;&lt;/span&gt;Scenario 2 Stages. Stage 8: 20 tasks, 173.5 MiB shuffle read, 12 seconds. Compare to Scenario 1's Stage 2: 200 tasks, 7 KiB shuffle read, 9 seconds. The absolute shuffle volume flags the problem before you even click in.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The 173.5 MiB total shuffle read across 20 tasks means each task processes approximately 8.7 MiB of serialized data, before accounting for the deserialized in-memory representation which is larger. This is where the memory constraint originates.&lt;/P&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Task Metrics: GC time is the signal&lt;/H3&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_6-1782414171745.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28291i70EBCBFA8FEE4F77/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_6-1782414171745.png" alt="Ashwin_DSA_6-1782414171745.png" /&gt;&lt;/span&gt;Stage 8 Task Metrics. Duration: Median 2s, Max 9s (4.5x ratio, much tighter than skew's 24x). GC Time: Max 4s on a 9s task, 44% of task time in garbage collection. Shuffle Read Median is 8.3 MiB / 200k records per task, confirming large uniform per-task data volumes.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;Two things separate this from skew:&lt;/P&gt;
&lt;UL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Duration ratio&lt;/STRONG&gt;: Max/Median is 4.5x, compared to 24x in Scenario 1. All the heavy tasks are slow together, not one outlier dragging the stage.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;GC Time&lt;/STRONG&gt;: Max 4 seconds on a 9-second task. That is 44% of task time in garbage collection. The JVM is constantly reclaiming the large string arrays built by &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;collect_list&lt;/CODE&gt;. This is the clearest memory pressure indicator available in the Spark UI.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Note there is no Spill row in this screenshot. On this cluster with 10.7 GiB executor memory, the data stays in heap but causes severe GC pressure. In production with smaller executor memory relative to partition size, the same root cause produces disk spill instead. The fix is identical in both cases: reduce per-task data volume.&lt;/P&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #1B3139; padding: 16px 20px; margin: 20px 0; border-radius: 4px; font-size: 15px;"&gt;&lt;STRONG&gt;Executors tab note&lt;/STRONG&gt;: on a multi-executor cluster, check the Executors tab. The GC Time column shows cumulative GC per executor. Uniformly high GC across all executors points to a partition sizing problem. GC concentrated on specific executors points to uneven data distribution.&lt;/DIV&gt;
&lt;HR /&gt;
&lt;H2 id="memory-fix" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 2b: Partition fix&lt;/H2&gt;
&lt;P&gt;The fix raises shuffle partitions from 20 to 400 and removes the &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;collect_list&lt;/CODE&gt; aggregation. Each task now receives a much smaller data chunk, and no large in-memory buffers are built.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"400"&lt;/SPAN&gt;)

result = (
    wide_data
    .repartition(&lt;SPAN&gt;400&lt;/SPAN&gt;, &lt;SPAN&gt;"group_key"&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"group_key"&lt;/SPAN&gt;)
    .agg(F.sum(&lt;SPAN&gt;"metric"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"total"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
    &lt;SPAN&gt;# collect_list removed: no large in-memory accumulator&lt;/SPAN&gt;
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_11-1782414614629.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28296i7F3A49006216025E/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_11-1782414614629.png" alt="Ashwin_DSA_11-1782414614629.png" /&gt;&lt;/span&gt;Scenario 2b Stages. Stage 10: 400 tasks, 39.0 MiB shuffle read (down from 173.5 MiB), 6 seconds (down from 12 seconds).&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_12-1782414635602.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28297i7F369B63913346EC/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_12-1782414635602.png" alt="Ashwin_DSA_12-1782414635602.png" /&gt;&lt;/span&gt;Scenario 2b Task Metrics. GC Time: 0ms across all percentiles including Max. Duration: Median 10ms, Max 0.2s. The root cause is eliminated, not just reduced.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; margin: 20px 0; font-size: 15px;"&gt;
&lt;THEAD&gt;
&lt;TR style="border-bottom: 2px solid #1B3139;"&gt;
&lt;TH style="text-align: left; padding: 10px 12px; color: #1b3139;"&gt;Metric&lt;/TH&gt;
&lt;TH style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;20 partitions&lt;/TH&gt;
&lt;TH style="text-align: right; padding: 10px 12px; color: #1b3139;"&gt;400 partitions&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Stage duration&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;12s&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;6s&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0; background: #FAFBFC;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Median task duration&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;2s&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;10ms&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Max task duration&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;9s&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;0.2s&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0; background: #FAFBFC;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Max GC Time&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;4s (44% of task)&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;0ms&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR&gt;
&lt;TD style="padding: 10px 12px;"&gt;Total shuffle read&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px; color: #ff3621;"&gt;173.5 MiB&lt;/TD&gt;
&lt;TD style="text-align: right; padding: 10px 12px;"&gt;39.0 MiB&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;P&gt;GC time dropping to zero is the validation signal. Not the wall-clock improvement (which is real), but the fact that the JVM has nothing to collect. The objects that were triggering pressure no longer exist at that size in heap.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 id="parallelism" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Scenario 3: Underparallelism&lt;/H2&gt;
&lt;P&gt;The lab job runs a 3 million-row aggregation after repartitioning to 4 partitions. There is no skew and no spill. The cluster is simply not given enough tasks to use its available cores.&lt;/P&gt;
&lt;PRE style="background: #1B3139; border-left: 4px solid #FF3621; border-radius: 6px; padding: 20px 24px; margin: 20px 0; font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; font-size: 13.5px; line-height: 1.6; color: #e8ecf0; overflow-x: auto;"&gt;spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"4"&lt;/SPAN&gt;)  &lt;SPAN&gt;# intentionally too low&lt;/SPAN&gt;

under_data = (
    spark.range(&lt;SPAN&gt;3_000_000&lt;/SPAN&gt;)
    .withColumn(&lt;SPAN&gt;"category"&lt;/SPAN&gt;, (F.rand() * &lt;SPAN&gt;100&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"int"&lt;/SPAN&gt;).cast(&lt;SPAN&gt;"string"&lt;/SPAN&gt;))
    .withColumn(&lt;SPAN&gt;"value"&lt;/SPAN&gt;, F.rand() * &lt;SPAN&gt;500&lt;/SPAN&gt;)
)

result = (
    under_data
    .repartition(&lt;SPAN&gt;4&lt;/SPAN&gt;)  &lt;SPAN&gt;# 4 tasks regardless of cluster size&lt;/SPAN&gt;
    .groupBy(&lt;SPAN&gt;"category"&lt;/SPAN&gt;)
    .agg(F.avg(&lt;SPAN&gt;"value"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"avg_val"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
    .orderBy(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;, ascending=&lt;SPAN&gt;False&lt;/SPAN&gt;)
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()

&lt;SPAN&gt;# Fix: raise partitions to match data volume&lt;/SPAN&gt;
spark.conf.set(&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN&gt;"100"&lt;/SPAN&gt;)
result = (
    under_data.repartition(&lt;SPAN&gt;100&lt;/SPAN&gt;)
    .groupBy(&lt;SPAN&gt;"category"&lt;/SPAN&gt;)
    .agg(F.avg(&lt;SPAN&gt;"value"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"avg_val"&lt;/SPAN&gt;), F.count(&lt;SPAN&gt;"*"&lt;/SPAN&gt;).alias(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;))
    .orderBy(&lt;SPAN&gt;"cnt"&lt;/SPAN&gt;, ascending=&lt;SPAN&gt;False&lt;/SPAN&gt;)
)
result.write.format(&lt;SPAN&gt;"noop"&lt;/SPAN&gt;).mode(&lt;SPAN&gt;"overwrite"&lt;/SPAN&gt;).save()&lt;/PRE&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_13-1782414658971.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28298i0430183E617B73C0/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_13-1782414658971.png" alt="Ashwin_DSA_13-1782414658971.png" /&gt;&lt;/span&gt;Scenario 3 Stages. Three stages, all with 4/4 tasks. On a production cluster with 32 or 64 cores, the same configuration leaves the vast majority of the cluster idle throughout the job.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;P&gt;The underparallelism tell in the Stages tab is the task count. If every stage runs a small, fixed number of tasks regardless of data volume, the shuffle partition count is almost certainly the constraint. Check &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;spark.sql.shuffle.partitions&lt;/CODE&gt; and compare it to 2x your executor core count as a starting floor.&lt;/P&gt;
&lt;P&gt;Unlike skew and memory pressure, underparallelism produces no spill, no GC pressure, and a tight Max/Median ratio. All tasks run cleanly. The job simply uses a fraction of available parallelism, so it takes proportionally longer than it needs to.&lt;/P&gt;
&lt;FIGURE style="margin: 16px 0;"&gt;
&lt;FIGCAPTION style="font-size: 13px; color: #6b8a97; margin-top: 8px; font-style: italic;"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Ashwin_DSA_14-1782414675999.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28299iF1DD90C43FC681A0/image-size/large?v=v2&amp;amp;px=999" role="button" title="Ashwin_DSA_14-1782414675999.png" alt="Ashwin_DSA_14-1782414675999.png" /&gt;&lt;/span&gt;Scenario 3b Stages. 100 tasks in the aggregation stage, 94 in the sort stage. On a production cluster with 32 cores, the 4-task version wastes 28 cores per wave. The 100-task version keeps the cluster busy and completes in roughly 1/8 the wall-clock time.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;HR /&gt;
&lt;H2 id="decision" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Decision map and practical thresholds&lt;/H2&gt;
&lt;TABLE style="width: 100%; border-collapse: collapse; margin: 16px 0; font-size: 14px;"&gt;
&lt;THEAD&gt;
&lt;TR style="background: #1B3139; color: #ffffff;"&gt;
&lt;TH style="text-align: left; padding: 10px 12px;"&gt;Symptom&lt;/TH&gt;
&lt;TH style="text-align: left; padding: 10px 12px;"&gt;Root cause&lt;/TH&gt;
&lt;TH style="text-align: left; padding: 10px 12px;"&gt;First action&lt;/TH&gt;
&lt;/TR&gt;
&lt;/THEAD&gt;
&lt;TBODY&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Max/Median &amp;gt; 5x, shuffle read skewed&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Data skew&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Broadcast join if small side fits; salting if not&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0; background: #FAFBFC;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Spill on most tasks or GC &amp;gt; 10%&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Memory pressure&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;More partitions before adding executor memory&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="border-bottom: 1px solid #E8ECF0;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;Task count &amp;lt; 2x executor cores&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Underparallelism&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Raise &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 5px; border-radius: 3px; font-size: 13px;"&gt;spark.sql.shuffle.partitions&lt;/CODE&gt; or add &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 5px; border-radius: 3px; font-size: 13px;"&gt;repartition()&lt;/CODE&gt;&lt;/TD&gt;
&lt;/TR&gt;
&lt;TR style="background: #FAFBFC;"&gt;
&lt;TD style="padding: 10px 12px;"&gt;None of the above&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;Plan problem&lt;/TD&gt;
&lt;TD style="padding: 10px 12px;"&gt;SQL tab: check for cross joins, missing predicate pushdown, wrong join strategy&lt;/TD&gt;
&lt;/TR&gt;
&lt;/TBODY&gt;
&lt;/TABLE&gt;
&lt;H3 style="font-size: 19px; font-weight: bold; color: #1b3139; margin: 24px 0 12px;"&gt;Thresholds experienced teams use&lt;/H3&gt;
&lt;UL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Skew threshold&lt;/STRONG&gt;: any task reading more than 3x the median shuffle bytes warrants investigation. AQE's skew join optimization fires at 256MB by default. Lowering it to 64MB catches problems earlier: &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=67108864&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Spill threshold&lt;/STRONG&gt;: any disk spill is worth addressing. Even 100MB of spill indicates the executor memory-to-partition-size ratio is wrong.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;GC threshold&lt;/STRONG&gt;: GC time above 10% of task time in the Executors tab means memory is a primary constraint. Above 20%, it is the dominant cause of slowness.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Broadcast threshold&lt;/STRONG&gt;: &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #F0F4F6; padding: 2px 6px; border-radius: 4px; font-size: 14.5px; color: #c03020;"&gt;spark.sql.autoBroadcastJoinThreshold&lt;/CODE&gt; defaults to 10MB. In practice, tables up to 500MB to 1GB broadcast safely on modern clusters if executor memory is adequate. Check the physical plan in the SQL tab to confirm which join strategy Spark chose.&lt;/LI&gt;
&lt;LI style="margin-bottom: 8px;"&gt;&lt;STRONG&gt;Partition sizing&lt;/STRONG&gt;: target 128 to 256MB of input data per partition for shuffle-heavy stages. For compute-intensive stages with complex UDFs, target 64MB to avoid GC pressure from large in-memory objects.&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV style="background: #F0F4F6; border-left: 4px solid #FF3621; padding: 16px 20px; margin: 20px 0; border-radius: 4px; font-size: 15px;"&gt;&lt;STRONG&gt;Enable AQE first&lt;/STRONG&gt;: set &lt;CODE style="font-family: 'SF Mono', Monaco, Consolas, 'Courier New', monospace; background: #E8ECF0; padding: 2px 5px; border-radius: 3px; font-size: 13.5px;"&gt;spark.sql.adaptive.enabled=true&lt;/CODE&gt; before doing any manual tuning. AQE resolves the majority of shuffle partition count and broadcast join problems automatically. Use the investigation sequence above for what AQE does not fix: severe skew on low-cardinality keys, memory pressure from large individual task sizes, and first-stage underparallelism before AQE has seen statistics.&lt;/DIV&gt;
&lt;HR /&gt;
&lt;H2 id="validation" style="font-size: 26px; font-weight: bold; color: #1b3139; margin: 28px 0 16px; padding-bottom: 8px; border-bottom: 3px solid #FF3621; display: inline-block;"&gt;Validating the fix&lt;/H2&gt;
&lt;P&gt;A single faster run is not enough. Here is what to check after applying a fix:&lt;/P&gt;
&lt;OL style="padding-left: 24px; margin: 0 0 16px;"&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;The metric moved, not just the clock.&lt;/STRONG&gt; If you salted for skew, confirm the Max/Median ratio dropped below 2x. If you increased partitions for memory pressure, confirm GC time dropped to zero or near zero, not just reduced. Wall-clock improvement with no change in the underlying metric means something else is limiting performance.&lt;/LI&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;Spill is zero, not reduced.&lt;/STRONG&gt; Spill dropping from 10GB to 2GB means you improved the memory-to-partition ratio but did not fully resolve it. The correct partition size produces zero spill.&lt;/LI&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;Run on a cold cache.&lt;/STRONG&gt; The second run of a job often benefits from cached shuffle files from the first run. Force a fresh run by clearing cache or changing the input path, otherwise the improvement may be partially artificial.&lt;/LI&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;Run at full production data volume.&lt;/STRONG&gt; Skew and memory pressure are data-volume-dependent. A fix that works on a 10% sample frequently fails at full volume because the skewed key's frequency is nonlinear and per-partition data volume changes.&lt;/LI&gt;
&lt;LI style="margin-bottom: 12px;"&gt;&lt;STRONG&gt;Check across a time window.&lt;/STRONG&gt; For recurring jobs, pull 7 days of run history after applying the fix and confirm duration variance dropped. A job that averages fast but spikes 3x on weekends still has a skew or partition imbalance problem the average obscures.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Hope this is useful for diagnosing your slow stages. If you have additional techniques or thresholds that work well in your environment, please share them in the comments. These patterns improve with more data points.&lt;/P&gt;
&lt;/DIV&gt;</description>
      <pubDate>Thu, 25 Jun 2026 19:11:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/reading-spark-ui-a-repeatable-guide-to-finding-performance/m-p/160574#M1320</guid>
      <dc:creator>Ashwin_DSA</dc:creator>
      <dc:date>2026-06-25T19:11:51Z</dc:date>
    </item>
    <item>
      <title>DataFlint on Databricks - the Open Source Spark UI Upgrade Apache Spark Has Needed for Years</title>
      <link>https://community.databricks.com/t5/community-articles/dataflint-on-databricks-the-open-source-spark-ui-upgrade-apache/m-p/160365#M1309</link>
      <description>&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1782288499357.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28204i98FDE207BB004E16/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1782288499357.png" alt="szymon_dybczak_0-1782288499357.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;H3 id="7b31"&gt;Introduction&lt;/H3&gt;&lt;P class=""&gt;Apache Spark has become one of the most widely adopted engines for large-scale data processing. Its appeal is easy to understand: it supports batch processing, streaming workloads, feature engineering, machine learning pipelines, and large-scale analytical transformations across nearly every major data platform. It gives teams a powerful and flexible way to process data at massive scale.&lt;/P&gt;&lt;P class=""&gt;But that power comes with complexity. Because Spark is a distributed computing engine, its behavior is not always easy to reason about when something stops working as expected. A simple symptom, such as a slow pipeline or a failed job, can be caused by many different things: skewed data, inefficient joins, shuffle bottlenecks, memory pressure, spilling, poor partitioning, executor failures, configuration issues, or an unexpected change in the physical plan.&lt;/P&gt;&lt;P class=""&gt;The Spark UI contains a huge amount of useful information, but making sense of it is not straightforward. Relevant details are scattered across multiple tabs. Each tab gives you part of the picture, but rarely the full story.&lt;/P&gt;&lt;P class=""&gt;The Spark UI itself also feels dated in some areas. When tracking a long-running query, you often need to refresh the page manually to see progress. To understand what is happening, you have to jump between multiple tabs and mentally connect information that is presented separately.&lt;BR /&gt;As a result, Spark debugging can become cognitively demanding very quickly. The hard part is not just finding metrics, but understanding which metrics matter, how they relate to each other, and what conclusion can be drawn from them. For many teams, getting from “this job is slow” to “this is the actual bottleneck” still requires a lot of experience, patience, and manual investigation&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_1-1782288499410.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28203i7DED6BE0508BA46C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_1-1782288499410.png" alt="szymon_dybczak_1-1782288499410.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;This is exactly the gap&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;DataFlint&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;aims to close.&lt;/P&gt;&lt;H3 id="23c3"&gt;What is DataFlint OSS?&lt;/H3&gt;&lt;P class=""&gt;DataFlint OSS is an open-source monitoring and debugging plugin for Apache Spark. It does not replace the Spark UI. Instead, it extends it by adding a dedicated DataFlint tab inside each Spark application.&lt;/P&gt;&lt;P class=""&gt;The goal is simple: make Spark performance easier to understand. DataFlint does this by going beyond raw execution data. It highlights patterns that often indicate performance problems, such as data skew, inefficient resource usage, small files, large partitions, problematic joins, or suspicious executor behavior. These findings appear as alerts, helping you move from “something is slow” to a likely explanation much faster.&lt;/P&gt;&lt;P class=""&gt;DataFlint also adds a more modern interface for exploring Spark applications, including run summaries, stage breakdowns, heat maps, syntax highlighting, and optional instrumentation for detailed operator-level timing. We will look at these features in the demo section, but the key idea is this: DataFlint makes it much easier to understand what happened in a Spark job and where to focus your attention.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_2-1782288499481.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28202i32A533A3C73A5046/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_2-1782288499481.png" alt="szymon_dybczak_2-1782288499481.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;H3 id="b60e"&gt;How it works?&lt;/H3&gt;&lt;P class=""&gt;DataFlint is built on top of Apache Spark’s official plugin API. When a Spark application starts, Spark loads SparkDataflintPlugin through the normal plugin lifecycle. The plugin then creates a driver-side component, SparkDataflintDriverPlugin, and Spark calls its init() method during driver startup, before any jobs begin running.&lt;/P&gt;&lt;P class=""&gt;This is important - DataFlint uses native extension mechanism Spark provides for instrumentation and runtime integrations.&lt;/P&gt;&lt;P class=""&gt;The initialization flow has two main steps:&lt;/P&gt;&lt;OL class=""&gt;&lt;LI&gt;First, during init(), DataFlint can register a SQL extension called DataFlintInstrumentationExtension. This only happens when at least one instrumentation option is explicitly enabled. When enabled, the extension modifies Spark SQL physical plans by wrapping selected operators with timing nodes. These wrappers collect wall-clock duration metrics for parts of the query plan that the native Spark UI does not expose in the same level of detail.&lt;/LI&gt;&lt;LI&gt;Second, after init() completes, Spark calls registerMetrics(). At this point, DataFlint installs its Web UI tab, registers the REST endpoints used by the frontend, serves the bundled React single-page application, and attaches event listeners to Spark’s listener bus.&lt;/LI&gt;&lt;/OL&gt;&lt;P class=""&gt;The result is a modern single-page application embedded directly into the existing Spark Web UI. It can poll live application metrics, update without full page refreshes, and run without requiring a separate service outside the Spark driver process.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_3-1782288499412.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28205i41899C8966B5BA6D/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_3-1782288499412.png" alt="szymon_dybczak_3-1782288499412.png" /&gt;&lt;/span&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;H2 id="c8ad"&gt;Installation&lt;/H2&gt;&lt;P class=""&gt;There are two main ways to install DataFlint on Databricks:&lt;/P&gt;&lt;OL class=""&gt;&lt;LI&gt;Install it directly from a notebook (super easy):&lt;BR /&gt;&lt;A class="" href="https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-databricks#install-on-databricks-from-a-notebook" target="_blank" rel="noopener ugc nofollow"&gt;https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-databricks#install-on-databricks-from-a-notebook&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Install it as a Spark plugin on a Databricks cluster, which is the recommended approach.&lt;/LI&gt;&lt;/OL&gt;&lt;P class=""&gt;Since the second option is recommended, let’s walk through that setup.&lt;/P&gt;&lt;P class=""&gt;Before creating the init script, we need a location where the script can be stored. Databricks allows us to use either a Unity Catalog volume or workspace files for this purpose.&lt;/P&gt;&lt;P class=""&gt;In this example, we will use a Unity Catalog volume for two reasons. First, Unity Catalog volumes are the recommended option when using DBR 13.3 LTS or later with Unity Catalog enabled. Second, workspace files are not supported on clusters running in standard access mode( formerly known as shared access mode).&lt;/P&gt;&lt;P class=""&gt;To create a volume, we can use the following command:&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;%&lt;/SPAN&gt;&lt;SPAN class=""&gt;sql&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;CREATE&lt;/SPAN&gt; VOLUME databricks_demo_ws.default.demo_volume;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;Once the volume is ready, the next step is to prepare the init script. Open the DataFlint documentation and copy the init script that matches your Spark version. In my case, the cluster runs on Apache Spark 4.0, so I selected the corresponding DataFlint script.&lt;/P&gt;&lt;P class=""&gt;Save the script content as&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;init_script.sh, and upload it to the Unity Catalog volume.&lt;/P&gt;&lt;P class=""&gt;One important thing to keep in mind is that standard access mode requires an administrator to add init scripts to the allowlist. Without this step, the cluster will fail to start and return an error similar to the following:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_4-1782288500110.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28207i4CBE9600E6C65617/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_4-1782288500110.png" alt="szymon_dybczak_4-1782288500110.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;P class=""&gt;To handle this, open Unity Catalog Explorer and follow these steps:&lt;/P&gt;&lt;OL class=""&gt;&lt;LI&gt;In your Databricks workspace, click&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Catalog&lt;/STRONG&gt;.&lt;/LI&gt;&lt;LI&gt;Click the gear icon.&lt;/LI&gt;&lt;LI&gt;Click the metastore name to open the metastore details and permissions page.&lt;/LI&gt;&lt;LI&gt;Select&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Allowed JARs/Init Scripts&lt;/STRONG&gt;.&lt;/LI&gt;&lt;LI&gt;Click&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Add init script&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and provide the correct path to your script.&lt;/LI&gt;&lt;/OL&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_5-1782288500193.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28206iF77162AFF64190A6/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_5-1782288500193.png" alt="szymon_dybczak_5-1782288500193.png" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Once the init script has been added to the allowlist, go to your compute configuration. Open the&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Advanced&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;section, then go to&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Init scripts&lt;/STRONG&gt;&lt;SPAN&gt;. Choose&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Volumes&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;as the source and provide the full path to the uploaded&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;init_script.sh&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;file.&lt;/SPAN&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_6-1782288499500.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28208i59FBEF03BB25A0A7/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_6-1782288499500.png" alt="szymon_dybczak_6-1782288499500.png" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Great, at this step we’re ready for installation. During my first attempt, the cluster failed to start because of a small typo in the init script that was available in the DataFlint documentation at the time.&lt;/SPAN&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_7-1782288499499.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28209i3D717E4B3796868C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_7-1782288499499.png" alt="szymon_dybczak_7-1782288499499.png" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;After a short debugging session, I corrected the script locally and flagged the issue to the DataFlint team. They responded very quickly, and the documentation issue has already been fixed.&lt;/SPAN&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;I’m leaving this note here for transparency, and also as a reminder that when working with init scripts, even a very small typo can prevent the cluster from starting.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_8-1782288502932.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28210iF2E3AD11BBD39759/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_8-1782288502932.png" alt="szymon_dybczak_8-1782288502932.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;Below is the corrected version you can copy and paste:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;DATAFLINT_VERSION=&lt;SPAN class=""&gt;"0.9.9"&lt;/SPAN&gt;&lt;BR /&gt;SPARK_DEFAULTS_FILE=&lt;SPAN class=""&gt;"/databricks/driver/conf/00-custom-spark-driver-defaults.conf"&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;mkdir&lt;/SPAN&gt; -p /databricks/jars/&lt;BR /&gt;&lt;BR /&gt;wget --quiet \&lt;BR /&gt;  -O /databricks/jars/dataflint_spark4-databricks_2.13-&lt;SPAN class=""&gt;$DATAFLINT_VERSION&lt;/SPAN&gt;.jar \&lt;BR /&gt;  https://repo1.maven.org/maven2/io/dataflint/dataflint-spark4-databricks_2.13/&lt;SPAN class=""&gt;$DATAFLINT_VERSION&lt;/SPAN&gt;/dataflint-spark4-databricks_2.13-&lt;SPAN class=""&gt;$DATAFLINT_VERSION&lt;/SPAN&gt;.jar&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;if&lt;/SPAN&gt; [[ &lt;SPAN class=""&gt;$DB_IS_DRIVER&lt;/SPAN&gt; = &lt;SPAN class=""&gt;"TRUE"&lt;/SPAN&gt; ]]; &lt;SPAN class=""&gt;then&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class=""&gt;mkdir&lt;/SPAN&gt; -p /mnt/driver-daemon/jars/&lt;BR /&gt;  &lt;SPAN class=""&gt;cp&lt;/SPAN&gt; /databricks/jars/dataflint_spark4-databricks_2.13-&lt;SPAN class=""&gt;$DATAFLINT_VERSION&lt;/SPAN&gt;.jar /mnt/driver-daemon/jars/dataflint_spark4-databricks_2.13-&lt;SPAN class=""&gt;$DATAFLINT_VERSION&lt;/SPAN&gt;.jar&lt;BR /&gt;  &lt;SPAN class=""&gt;echo&lt;/SPAN&gt; &lt;SPAN class=""&gt;"[driver] {"&lt;/SPAN&gt; &amp;gt;&amp;gt; &lt;SPAN class=""&gt;$SPARK_DEFAULTS_FILE&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class=""&gt;echo&lt;/SPAN&gt; &lt;SPAN class=""&gt;"  spark.plugins = io.dataflint.spark.SparkDataflintPlugin"&lt;/SPAN&gt; &amp;gt;&amp;gt; &lt;SPAN class=""&gt;$SPARK_DEFAULTS_FILE&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class=""&gt;echo&lt;/SPAN&gt; &lt;SPAN class=""&gt;"}"&lt;/SPAN&gt; &amp;gt;&amp;gt; &lt;SPAN class=""&gt;$SPARK_DEFAULTS_FILE&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;fi&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;&lt;BR /&gt;If you author&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;init_script.sh&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;on a Windows machine, your editor will very likely save it with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Windows (CRLF) line endings&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(\r\n). The Databricks driver runs Linux, where the shell expects&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Unix (LF) line endings&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(\n).&lt;BR /&gt;A script saved with CRLF will fail in confusing ways - the trailing carriage return gets attached to the last token on each line, so you'll see errors such as:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;/bin/bash: line 2: $&lt;SPAN class=""&gt;'\r'&lt;/SPAN&gt;: &lt;SPAN class=""&gt;command&lt;/SPAN&gt; not found&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;&lt;STRONG&gt;How to avoid it:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;In&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;VS Code&lt;/STRONG&gt;, click the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;CRLF&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;indicator in the bottom-right status bar and switch it to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;LF, then save.&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;This single issue comes up surprisingly often in Databricks Community threads, so I think it’s worth mentioning here.&lt;/P&gt;&lt;H2 id="a03b"&gt;DataFlint Demo: Seeing It in Action&lt;/H2&gt;&lt;P class=""&gt;Once the installation is complete, we can take a closer look at what DataFlint actually adds to the Spark debugging experience.&lt;/P&gt;&lt;P class=""&gt;To make the walkthrough more concrete, let’s run a sample analytical query against the TPC-DS dataset and then open the DataFlint tab in the Spark UI:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;SELECT&lt;/SPAN&gt;&lt;BR /&gt;    w.w_warehouse_name,&lt;BR /&gt;    it.i_category,&lt;BR /&gt;    it.i_class,&lt;BR /&gt;    &lt;SPAN class=""&gt;COUNT&lt;/SPAN&gt;(&lt;SPAN class=""&gt;DISTINCT&lt;/SPAN&gt; i.inv_item_sk) &lt;SPAN class=""&gt;AS&lt;/SPAN&gt; distinct_items,&lt;BR /&gt;    &lt;SPAN class=""&gt;COUNT&lt;/SPAN&gt;(&lt;SPAN class=""&gt;DISTINCT&lt;/SPAN&gt; i.inv_date_sk) &lt;SPAN class=""&gt;AS&lt;/SPAN&gt; inventory_dates,&lt;BR /&gt;    &lt;SPAN class=""&gt;COUNT&lt;/SPAN&gt;(&lt;SPAN class=""&gt;*&lt;/SPAN&gt;) &lt;SPAN class=""&gt;AS&lt;/SPAN&gt; inventory_rows,&lt;BR /&gt;    &lt;SPAN class=""&gt;SUM&lt;/SPAN&gt;(i.inv_quantity_on_hand) &lt;SPAN class=""&gt;AS&lt;/SPAN&gt; total_quantity_on_hand,&lt;BR /&gt;    &lt;SPAN class=""&gt;AVG&lt;/SPAN&gt;(i.inv_quantity_on_hand) &lt;SPAN class=""&gt;AS&lt;/SPAN&gt; avg_quantity_on_hand,&lt;BR /&gt;    &lt;SPAN class=""&gt;MIN&lt;/SPAN&gt;(i.inv_quantity_on_hand) &lt;SPAN class=""&gt;AS&lt;/SPAN&gt; min_quantity_on_hand,&lt;BR /&gt;    &lt;SPAN class=""&gt;MAX&lt;/SPAN&gt;(i.inv_quantity_on_hand) &lt;SPAN class=""&gt;AS&lt;/SPAN&gt; max_quantity_on_hand&lt;BR /&gt;&lt;SPAN class=""&gt;FROM&lt;/SPAN&gt; samples.tpcds_sf1000.inventory i&lt;BR /&gt;&lt;SPAN class=""&gt;JOIN&lt;/SPAN&gt; samples.tpcds_sf1000.warehouse w&lt;BR /&gt;    &lt;SPAN class=""&gt;ON&lt;/SPAN&gt; i.inv_warehouse_sk &lt;SPAN class=""&gt;=&lt;/SPAN&gt; w.w_warehouse_sk&lt;BR /&gt;&lt;SPAN class=""&gt;JOIN&lt;/SPAN&gt; samples.tpcds_sf1000.item it&lt;BR /&gt;    &lt;SPAN class=""&gt;ON&lt;/SPAN&gt; i.inv_item_sk &lt;SPAN class=""&gt;=&lt;/SPAN&gt; it.i_item_sk&lt;BR /&gt;&lt;SPAN class=""&gt;GROUP&lt;/SPAN&gt; &lt;SPAN class=""&gt;BY&lt;/SPAN&gt;&lt;BR /&gt;    w.w_warehouse_name,&lt;BR /&gt;    it.i_category,&lt;BR /&gt;    it.i_class&lt;BR /&gt;&lt;SPAN class=""&gt;ORDER&lt;/SPAN&gt; &lt;SPAN class=""&gt;BY&lt;/SPAN&gt;&lt;BR /&gt;    total_quantity_on_hand &lt;SPAN class=""&gt;DESC&lt;/SPAN&gt;;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;For the demo, we will follow a typical investigation flow: start with the high-level symptoms, check whether the cluster was used efficiently, inspect the SQL plan, let alerts point us to likely issues, and only then enable deeper instrumentation if we need more precise timing.&lt;/P&gt;&lt;OL class=""&gt;&lt;LI&gt;Start with the Summary page to understand the workload at a high level.&lt;/LI&gt;&lt;LI&gt;Check cluster resources to see how executors, memory, and cores behaved during the run.&lt;/LI&gt;&lt;LI&gt;Inspect the SQL plan to identify expensive operators and heavy parts of the query.&lt;/LI&gt;&lt;LI&gt;Use Alerts to jump directly to suspicious patterns instead of manually hunting through metrics.&lt;/LI&gt;&lt;LI&gt;Enable instrumentation when you need more precise operator-level timing.&lt;/LI&gt;&lt;/OL&gt;&lt;H2 id="7b66"&gt;Step 1: Start with the Summary Page&lt;/H2&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_9-1782288499808.gif" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28213i0B3FA81C53F36AD8/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_9-1782288499808.gif" alt="szymon_dybczak_9-1782288499808.gif" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;The Summary page gives a high-level view of how your query used cluster resources during execution. Instead of immediately jumping between Spark UI tabs such as Jobs, Stages, Executors, and SQL, DataFlint brings the most important performance indicators into one place.&lt;/P&gt;&lt;P class=""&gt;This makes the Summary page a useful first checkpoint. It helps you quickly understand whether a Spark job was efficient, over-provisioned, memory-constrained, shuffle-heavy, or affected by spill operations.&lt;/P&gt;&lt;P class=""&gt;At the top of the page, DataFlint shows metrics such as Duration, DCU, Input, Output, Memory Usage, Shuffle Read, Shuffle Write, Spill to Disk, Idle Cores, and Task Error Rate. Together, these metrics describe both the cost and behavior of the workload.&lt;/P&gt;&lt;P class=""&gt;One metric that deserves a short explanation is&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;DCU&lt;/STRONG&gt;, which stands for DataFlint Compute Units. DCU is DataFlint’s measurement unit for Spark usage, similar in concept to a Databricks Unit, or DBU. It combines CPU and memory allocation into a single usage metric.&lt;/P&gt;&lt;P class=""&gt;The formula is:&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;DCU&lt;/SPAN&gt; = (Core/Hour usage * &lt;SPAN class=""&gt;0.05&lt;/SPAN&gt;) + (GiB Memory/Hour usage * &lt;SPAN class=""&gt;0.005&lt;/SPAN&gt;)&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;&lt;STRONG&gt;Core/Hour:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;is the number of cores allocated for your app in hours measurement.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;GiB Memory/Hour:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;is the number of memory in GiB units allocated for your app in hours measurement&lt;/P&gt;&lt;P class=""&gt;Another useful detail is that the Summary page refreshes automatically in real time. Unlike the standard Apache Spark Web UI, you do not need to manually refresh the page to see updated metrics while the application is running. This makes live monitoring much more comfortable, especially when you want to watch resource usage, memory pressure, shuffle volume, spill, or task failures as they happen.&lt;/P&gt;&lt;H2 id="78b1"&gt;Step 2: Check whether the cluster was used efficiently&lt;/H2&gt;&lt;P class=""&gt;Next, let’s look at the Resources page. This page gives a more focused view of how Spark resources are allocated and used during the job.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_10-1782288500537.gif" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28212iB30E193199C24866/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_10-1782288500537.gif" alt="szymon_dybczak_10-1782288500537.gif" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;The&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Executors Timeline&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;makes dynamic allocation easy to understand visually. In this run, Spark started with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;1 executor&lt;/STRONG&gt;, and later scales to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;2 executors&lt;/STRONG&gt;. The configuration table below the chart shows the executor and driver resources, such as cores and memory, together with dynamic allocation settings like minimum and maximum executors.&lt;BR /&gt;DataFlint puts the resource timeline and configuration details in one place, which makes it easier to connect workload behavior with cluster behavior.&lt;/P&gt;&lt;H2 id="38b5"&gt;Step 3: Inspect the SQL Plan&lt;/H2&gt;&lt;P class=""&gt;After checking the overall workload and cluster behavior, we can move from symptoms to execution details. The Summary page contains a list of executed and currently running queries. When you click one of them, DataFlint opens an interactive SQL execution plan graph.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_11-1782288499483.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28211i6D7073478046B269/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_11-1782288499483.png" alt="szymon_dybczak_11-1782288499483.png" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;For example, clicking query ID 10 opens a graph view of the query plan. The plan can be viewed in three modes:&lt;/SPAN&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;STRONG&gt;I/O Only&lt;/STRONG&gt;: input/output scan and write nodes.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Basic&lt;/STRONG&gt;: the main transformations.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Advanced&lt;/STRONG&gt;: every node in the plan.&lt;/LI&gt;&lt;/UL&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_12-1782288503110.gif" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28216i3892240AD2478FF0/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_12-1782288503110.gif" alt="szymon_dybczak_12-1782288503110.gif" /&gt;&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;Each node shows the operator name, such as Filter, Exchange, or FileScan, together with key metrics like output rows, shuffle bytes, spill size, partitions, and table name for scans.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_13-1782288499830.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28214i5BE45F45E7AFBD2C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_13-1782288499830.png" alt="szymon_dybczak_13-1782288499830.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;A really useful feature is the performance heat bar at the top of each node. It uses green, orange, and red to show how much of the total query time was spent in that operator.&lt;/P&gt;&lt;P class=""&gt;There is also a MiniMap in the lower-left corner that directly complements the heat map. While the main graph lets you zoom in on individual nodes, complex queries can have plans that are too large to view in full at once. The MiniMap gives you a bird’s-eye view of the entire plan, with the same heat map coloring applied, so red nodes remain visible even when they are off-screen.&lt;/P&gt;&lt;P class=""&gt;Together, the heat map and MiniMap let you quickly locate the most expensive operator - and that’s super cool.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_14-1782288501613.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28215i8774B369C6EC3C63/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_14-1782288501613.png" alt="szymon_dybczak_14-1782288501613.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;Some nodes can also display additional badges:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;A green flag badge indicates that the node is instrumented by TimedExec and has precise wall-clock timing.&lt;/LI&gt;&lt;LI&gt;A rocket badge indicates that the operator is running on a native accelerator such as Gluten, Comet, or Photon.&lt;/LI&gt;&lt;LI&gt;An alert badge indicates that DataFlint detected a potential issue on that node.&lt;/LI&gt;&lt;/UL&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_15-1782288501073.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28219iCB700D936D13E40E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_15-1782288501073.png" alt="szymon_dybczak_15-1782288501073.png" /&gt;&lt;/span&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;Shuffle operations are especially nice in this view. DataFlint splits Exchange nodes into two separate half-nodes: shuffle write and shuffle read. Each side has its own metrics and stage association, which makes it much easier to see where shuffle cost is coming from.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_16-1782288500431.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28218i491DD606C1262743/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_16-1782288500431.png" alt="szymon_dybczak_16-1782288500431.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;The plan view also includes a stage grouping toggle. When enabled, nodes that run in the same Spark stage are visually enclosed in a stage container. Clicking a stage opens a side drawer with stage-level details.&lt;BR /&gt;This is another feature with no equivalent in Spark’s native UI. In Apache Spark, a query’s execution plan and its stages are two completely separate concepts - the SQL tab shows you the plan tree, and the Stages tab shows you a flat list of stages, with no visual connection between them. DataFlint bridges this gap by overlaying stage boundaries directly onto the plan graph.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_17-1782288501947.gif" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28217i4EF16E2F58D2856D/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_17-1782288501947.gif" alt="szymon_dybczak_17-1782288501947.gif" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;Finally, there is a duration mode toggle. You can switch between exclusive duration, which means time spent in that operator only, and inclusive duration, which means time spent in the operator and all of its children.&lt;/P&gt;&lt;P class=""&gt;The bottom toolbar contains useful navigation shortcuts:&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;Speed: cycles through nodes ordered by duration percentage, highest first.&lt;/LI&gt;&lt;LI&gt;Warning: cycles through nodes with alerts.&lt;/LI&gt;&lt;LI&gt;Storage: cycles through nodes with spill, ordered by spill size.&lt;/LI&gt;&lt;LI&gt;Fit view / zoom controls: help you navigate large plans.&lt;/LI&gt;&lt;/UL&gt;&lt;H2 id="9abf"&gt;Step 4: Let Alerts Point to the Problem&lt;/H2&gt;&lt;P class=""&gt;So far the Summary and Resources pages have helped us&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;observe&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;what a job did. The Alerts page is where DataFlint goes a step further and starts&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;reasoning&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;about it. Instead of leaving you to interpret the metrics yourself, it continuously inspects every query, stage, and executor against a set of built-in heuristics and surfaces the ones that look wrong - each one written up as a short, plain-English finding with a concrete suggestion on how to fix it.&lt;/P&gt;&lt;P class=""&gt;This is probably the single biggest difference from the native Spark UI. The native UI will show you that one task in a stage ran for 28 seconds while the others finished in a second but it will never tell you “this is data skew, and here’s what to do about it.”&lt;/P&gt;&lt;P class=""&gt;You need to know where to look, which numbers to compare, and what the comparison means. The Alerts page encodes that experience for you.&lt;/P&gt;&lt;P class=""&gt;When you open the tab, alerts are grouped by type and tagged as either a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;warning&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(yellow) or an&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;error&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(red), with a running count of each at the top. Every alert card has a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;“Go to alert”&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;button that jumps straight to the exact SQL node, stage, or resource the finding is about, so you never have to hunt for where the problem lives.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_18-1782288501946.gif" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28220iEB5312E00B119F75/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_18-1782288501946.gif" alt="szymon_dybczak_18-1782288501946.gif" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;H3 id="2910"&gt;A concrete example: data skew&lt;/H3&gt;&lt;P class=""&gt;The TPC-DS sample tables I’ve used so far are nicely balanced, so they will not show a strong skew problem on their own. To make the issue visible, we need to manufacture one. For the synthetic example, imagine a multi-tenant event platform where most tenants are small, but one enterprise tenant generates almost all of the traffic. This is something you can encounter in real systems: one customer, account, region, or tenant dominates the data distribution.&lt;/P&gt;&lt;P class=""&gt;The setup is simple. Most tenants produce only a small number of events and have one routing rule. But `tenant_enterprise_001` produces around 97% of all events and has 100 routing rules. Then we run a reasonable-looking analytics query: join events to routing rules and summarize how many events each rule matched.&lt;/P&gt;&lt;P class=""&gt;Before running the query, I also move a few Spark optimizations out of the way. I disable broadcast joins so Spark cannot simply broadcast the rules table and avoid the shuffle. I also disable Adaptive Query Execution so Spark’s automatic skew handling does not rescue the query before DataFlint has anything interesting to show.&lt;BR /&gt;In production, AQE is usually something you want enabled. For a demo like this, though, turning it off lets the skew surface clearly.&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;from&lt;/SPAN&gt; pyspark.sql &lt;SPAN class=""&gt;import&lt;/SPAN&gt; functions &lt;SPAN class=""&gt;as&lt;/SPAN&gt; F&lt;BR /&gt;&lt;BR /&gt;spark.conf.&lt;SPAN class=""&gt;set&lt;/SPAN&gt;(&lt;SPAN class=""&gt;"spark.sql.adaptive.enabled"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"false"&lt;/SPAN&gt;)&lt;BR /&gt;spark.conf.&lt;SPAN class=""&gt;set&lt;/SPAN&gt;(&lt;SPAN class=""&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"200"&lt;/SPAN&gt;)&lt;BR /&gt;spark.conf.&lt;SPAN class=""&gt;set&lt;/SPAN&gt;(&lt;SPAN class=""&gt;"spark.sql.autoBroadcastJoinThreshold"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"-1"&lt;/SPAN&gt;)&lt;BR /&gt;spark.conf.&lt;SPAN class=""&gt;set&lt;/SPAN&gt;(&lt;SPAN class=""&gt;"spark.sql.join.preferSortMergeJoin"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"true"&lt;/SPAN&gt;)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;N = &lt;SPAN class=""&gt;10_000_000&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;events_df = (&lt;BR /&gt;    spark.&lt;SPAN class=""&gt;range&lt;/SPAN&gt;(&lt;SPAN class=""&gt;0&lt;/SPAN&gt;, N, &lt;SPAN class=""&gt;1&lt;/SPAN&gt;, numPartitions=&lt;SPAN class=""&gt;128&lt;/SPAN&gt;)&lt;BR /&gt;    .select(&lt;BR /&gt;        F.when(F.rand(seed=&lt;SPAN class=""&gt;42&lt;/SPAN&gt;) &amp;lt; &lt;SPAN class=""&gt;0.97&lt;/SPAN&gt;, F.lit(&lt;SPAN class=""&gt;"tenant_enterprise_001"&lt;/SPAN&gt;))&lt;BR /&gt;         .otherwise(&lt;BR /&gt;             F.concat(&lt;BR /&gt;                 F.lit(&lt;SPAN class=""&gt;"tenant_"&lt;/SPAN&gt;),&lt;BR /&gt;                 F.lpad((F.col(&lt;SPAN class=""&gt;"id"&lt;/SPAN&gt;) % &lt;SPAN class=""&gt;20_000&lt;/SPAN&gt;).cast(&lt;SPAN class=""&gt;"string"&lt;/SPAN&gt;), &lt;SPAN class=""&gt;5&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"0"&lt;/SPAN&gt;)&lt;BR /&gt;             )&lt;BR /&gt;         ).alias(&lt;SPAN class=""&gt;"tenant_id"&lt;/SPAN&gt;),&lt;BR /&gt;&lt;BR /&gt;        F.col(&lt;SPAN class=""&gt;"id"&lt;/SPAN&gt;).alias(&lt;SPAN class=""&gt;"event_id"&lt;/SPAN&gt;),&lt;BR /&gt;        F.concat(F.lit(&lt;SPAN class=""&gt;"session_"&lt;/SPAN&gt;), (F.col(&lt;SPAN class=""&gt;"id"&lt;/SPAN&gt;) % &lt;SPAN class=""&gt;2_000_000&lt;/SPAN&gt;).cast(&lt;SPAN class=""&gt;"string"&lt;/SPAN&gt;)).alias(&lt;SPAN class=""&gt;"session_id"&lt;/SPAN&gt;),&lt;BR /&gt;        F.sha2(F.col(&lt;SPAN class=""&gt;"id"&lt;/SPAN&gt;).cast(&lt;SPAN class=""&gt;"string"&lt;/SPAN&gt;), &lt;SPAN class=""&gt;256&lt;/SPAN&gt;).alias(&lt;SPAN class=""&gt;"payload"&lt;/SPAN&gt;)&lt;BR /&gt;    )&lt;BR /&gt;)&lt;BR /&gt;&lt;BR /&gt;events_df.createOrReplaceTempView(&lt;SPAN class=""&gt;"demo_events"&lt;/SPAN&gt;)&lt;BR /&gt;&lt;BR /&gt;normal_rules_df = (&lt;BR /&gt;    spark.&lt;SPAN class=""&gt;range&lt;/SPAN&gt;(&lt;SPAN class=""&gt;1&lt;/SPAN&gt;, &lt;SPAN class=""&gt;20_001&lt;/SPAN&gt;, &lt;SPAN class=""&gt;1&lt;/SPAN&gt;, numPartitions=&lt;SPAN class=""&gt;32&lt;/SPAN&gt;)&lt;BR /&gt;    .select(&lt;BR /&gt;        F.concat(&lt;BR /&gt;            F.lit(&lt;SPAN class=""&gt;"tenant_"&lt;/SPAN&gt;),&lt;BR /&gt;            F.lpad(F.col(&lt;SPAN class=""&gt;"id"&lt;/SPAN&gt;).cast(&lt;SPAN class=""&gt;"string"&lt;/SPAN&gt;), &lt;SPAN class=""&gt;5&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"0"&lt;/SPAN&gt;)&lt;BR /&gt;        ).alias(&lt;SPAN class=""&gt;"tenant_id"&lt;/SPAN&gt;),&lt;BR /&gt;        F.lit(&lt;SPAN class=""&gt;1&lt;/SPAN&gt;).alias(&lt;SPAN class=""&gt;"rule_id"&lt;/SPAN&gt;),&lt;BR /&gt;        F.lit(&lt;SPAN class=""&gt;"standard_rule"&lt;/SPAN&gt;).alias(&lt;SPAN class=""&gt;"rule_type"&lt;/SPAN&gt;)&lt;BR /&gt;    )&lt;BR /&gt;)&lt;BR /&gt;&lt;BR /&gt;enterprise_rules_df = (&lt;BR /&gt;    spark.&lt;SPAN class=""&gt;range&lt;/SPAN&gt;(&lt;SPAN class=""&gt;1&lt;/SPAN&gt;, &lt;SPAN class=""&gt;101&lt;/SPAN&gt;, &lt;SPAN class=""&gt;1&lt;/SPAN&gt;, numPartitions=&lt;SPAN class=""&gt;4&lt;/SPAN&gt;)&lt;BR /&gt;    .select(&lt;BR /&gt;        F.lit(&lt;SPAN class=""&gt;"tenant_enterprise_001"&lt;/SPAN&gt;).alias(&lt;SPAN class=""&gt;"tenant_id"&lt;/SPAN&gt;),&lt;BR /&gt;        F.col(&lt;SPAN class=""&gt;"id"&lt;/SPAN&gt;).alias(&lt;SPAN class=""&gt;"rule_id"&lt;/SPAN&gt;),&lt;BR /&gt;        F.concat(F.lit(&lt;SPAN class=""&gt;"enterprise_rule_"&lt;/SPAN&gt;), F.col(&lt;SPAN class=""&gt;"id"&lt;/SPAN&gt;).cast(&lt;SPAN class=""&gt;"string"&lt;/SPAN&gt;)).alias(&lt;SPAN class=""&gt;"rule_type"&lt;/SPAN&gt;)&lt;BR /&gt;    )&lt;BR /&gt;)&lt;BR /&gt;&lt;BR /&gt;rules_df = normal_rules_df.unionByName(enterprise_rules_df)&lt;BR /&gt;&lt;BR /&gt;rules_df.createOrReplaceTempView(&lt;SPAN class=""&gt;"demo_routing_rules"&lt;/SPAN&gt;)&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;skew_query = &lt;SPAN class=""&gt;"""&lt;BR /&gt;SELECT&lt;BR /&gt;    r.rule_type,&lt;BR /&gt;    COUNT(*) AS matched_events,&lt;BR /&gt;    COUNT(DISTINCT e.session_id) AS unique_sessions,&lt;BR /&gt;    MIN(e.event_id) AS first_event_id,&lt;BR /&gt;    MAX(e.event_id) AS last_event_id&lt;BR /&gt;FROM demo_events e&lt;BR /&gt;JOIN demo_routing_rules r&lt;BR /&gt;  ON e.tenant_id = r.tenant_id&lt;BR /&gt;GROUP BY r.rule_type&lt;BR /&gt;ORDER BY matched_events DESC&lt;BR /&gt;"""&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;display(spark.sql(skew_query))&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;After the query finishes, open the Alerts page in DataFlint. You should see a Partition Skew warning for the join stage.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_19-1782288501867.gif" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28221iF909AB67B06248D8/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_19-1782288501867.gif" alt="szymon_dybczak_19-1782288501867.gif" /&gt;&lt;/span&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;If you click&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Go to Alert&lt;/STRONG&gt;, you will be redirected straight to the location in the physical plan where the problem occurs. This is a super nice feature.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_20-1782288502790.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28222iBE1F1E1CBFD63850/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_20-1782288502790.png" alt="szymon_dybczak_20-1782288502790.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;It is worth to explain why DataFlint treats this as skew rather than normal variation. The alert is based on the task-duration distribution within a stage. DataFlint compares the slowest task with the median task and raises a warning only when the difference is large enough to matter. That avoids noisy alerts for tiny stages where one task being a bit slower is irrelevant.&lt;/P&gt;&lt;P class=""&gt;In this example, the warning is expected because the workload is intentionally unbalanced. The median task processes a small tenant-sized partition, while the worst task processes the enterprise tenant and its 100-rule fan-out. That is exactly the scenario skew alerts are meant to make obvious.&lt;/P&gt;&lt;H3 id="5bc2"&gt;Another common offender: small files&lt;/H3&gt;&lt;P class=""&gt;Skew is about&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;time&lt;/EM&gt;; the next example is about&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;layout&lt;/EM&gt;. The problem here is a table written as thousands of tiny files instead of a few large ones - when you read it back, Spark spends more effort opening files and scheduling tasks than actually processing data. We can reproduce it intentionally by spreading a small amount of data across far too many output files:&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;-- Force 5,000 output files for a tiny dataset -&amp;gt; a few hundred rows&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;-- (a few KB) per file.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;CREATE&lt;/SPAN&gt; &lt;SPAN class=""&gt;TABLE&lt;/SPAN&gt; default.tiny_files &lt;SPAN class=""&gt;AS&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class=""&gt;SELECT&lt;/SPAN&gt; &lt;SPAN class=""&gt;/*+ REPARTITION(5000) */&lt;/SPAN&gt; id, rand() &lt;SPAN class=""&gt;AS&lt;/SPAN&gt; v&lt;BR /&gt;&lt;SPAN class=""&gt;FROM&lt;/SPAN&gt; &lt;SPAN class=""&gt;range&lt;/SPAN&gt;(&lt;SPAN class=""&gt;0&lt;/SPAN&gt;, &lt;SPAN class=""&gt;1000000&lt;/SPAN&gt;); &lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;A million rows split across 5,000 files is roughly 200 rows - a few kilobytes per file. Simply reading the table back triggers the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Reading Small Files&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;warning:&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;SELECT&lt;/SPAN&gt; &lt;SPAN class=""&gt;SUM&lt;/SPAN&gt;(v)&lt;BR /&gt;&lt;SPAN class=""&gt;FROM&lt;/SPAN&gt; default.tiny_files;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;The heuristic here is simple: DataFlint divides the bytes read by the number of files read for a scan, and if the average file is smaller than a few megabytes&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;and&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;the scan touched more than a hundred files, it flags it. There’s a matching alert on the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;write&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;side too - if your job is the one&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;producing&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;the small files, DataFlint will point that out and even tailor the advice to whether the output is partitioned.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_21-1782288502540.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28224iC43B4F8C7011CA05/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_21-1782288502540.png" alt="szymon_dybczak_21-1782288502540.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;H2 id="52f7"&gt;Step 5: Experimental feature: DataFlint Spark Instrumentation&lt;/H2&gt;&lt;P class=""&gt;DataFlint provides optional instrumentation that enhances Spark observability. It injects extra metrics and metadata into the Spark UI that Spark does not expose on its own. All instrumentation is&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;opt-in and disabled by default&lt;/STRONG&gt;, so nothing about your query planning changes unless you explicitly turn it on.&lt;/P&gt;&lt;P class=""&gt;The mechanism is worth understanding before we use it. When any instrumentation flag is enabled, DataFlint registers a Spark SQL extension during driver startup. That extension hooks into Spark’s physical planning phase and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;wraps selected operators with a lightweight timing node&lt;/STRONG&gt;. The wrapper is transparent - it shows up as a single node in the plan graph (its name simply gets a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;DataFlint&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;prefix), it keeps all of the operator's original metrics, and it adds one new metric:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;duration&lt;/STRONG&gt;, the wall-clock time that operator actually spent doing work.&lt;/P&gt;&lt;P class=""&gt;Instrumentation is split into granular flags so you can enable just what you need. The two we’ll look at here are&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Window instrumentation&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;SQL nodes instrumentation&lt;/STRONG&gt;:&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;# enable only window timing&lt;/SPAN&gt;&lt;BR /&gt;.config(&lt;SPAN class=""&gt;"spark.dataflint.instrument.spark.window.enabled"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"true"&lt;/SPAN&gt;)&lt;/SPAN&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;# enable timing for the common SQL operators (filters, joins, scans, aggregates, ...)&lt;/SPAN&gt;&lt;BR /&gt;.config&lt;SPAN class=""&gt;(&lt;/SPAN&gt;&lt;SPAN class=""&gt;"spark.dataflint.instrument.spark.sqlNodes.enabled"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"true"&lt;/SPAN&gt;&lt;SPAN class=""&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;# or turn everything on at once&lt;/SPAN&gt;&lt;BR /&gt;.config(&lt;SPAN class=""&gt;"spark.dataflint.instrument.spark.enabled"&lt;/SPAN&gt;, &lt;SPAN class=""&gt;"true"&lt;/SPAN&gt;)&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;On Databricks you’d add the same keys to your cluster’s Spark config.&lt;/P&gt;&lt;H2 id="1b11"&gt;Window instrumentation&lt;/H2&gt;&lt;P class=""&gt;Window instrumentation wraps Spark’s&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;WindowExec&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;so you can see how long the window computation took, right on the plan node.&lt;/P&gt;&lt;P class=""&gt;Let’s try it with what below dummy query:&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;SELECT&lt;/SPAN&gt;&lt;BR /&gt;    c_customer_sk,&lt;BR /&gt;    c_last_name&lt;BR /&gt;&lt;SPAN class=""&gt;FROM&lt;/SPAN&gt; samples.tpcds_sf1.customer&lt;BR /&gt;QUALIFY &lt;SPAN class=""&gt;row_number&lt;/SPAN&gt;() &lt;SPAN class=""&gt;OVER&lt;/SPAN&gt; (&lt;SPAN class=""&gt;PARTITION&lt;/SPAN&gt; &lt;SPAN class=""&gt;BY&lt;/SPAN&gt; c_last_name &lt;SPAN class=""&gt;ORDER&lt;/SPAN&gt; &lt;SPAN class=""&gt;BY&lt;/SPAN&gt; c_customer_sk &lt;SPAN class=""&gt;DESC&lt;/SPAN&gt;) &lt;SPAN class=""&gt;=&lt;/SPAN&gt; &lt;SPAN class=""&gt;1&lt;/SPAN&gt;;&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;We enable&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;spark.dataflint.instrument.spark.window.enabled, run the query, open the SQL plan... and, weirdly,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;there is no window operator to be found.&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;What happened?&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_22-1782288502433.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28225iBEA9206AF52BD1FA/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_22-1782288502433.png" alt="szymon_dybczak_22-1782288502433.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;After some investigation it turns out this query never uses a plain&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;WindowExec&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;operator at all. Because the window is immediately filtered by QUALIFY, Spark applies an optimization introduced in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;[SPARK-37099]&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;- a dedicated physical operator called&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;WindowGroupLimit&lt;/STRONG&gt;.&lt;/P&gt;&lt;P class=""&gt;The idea behind SPARK-37099 is as follows: for rank-style functions such as&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;row_number,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;rank, and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;dense_rank, the rank of a key computed on a partial dataset is always less than or equal to its final rank over the full dataset. This means Spark can safely discard rows whose partial rank already exceeds&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;k, before the expensive shuffle and window processing take place. To do this, Spark inserts a per-window-group limit both before and after the shuffle.&lt;/P&gt;&lt;P class=""&gt;As a result, the window execution time we intended to measure is now captured inside&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;WindowGroupLimit, which is not covered by the window flag.&lt;/P&gt;&lt;P class=""&gt;Let’s remove the filtering from our query.&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;&lt;SPAN class=""&gt;SELECT&lt;/SPAN&gt;&lt;BR /&gt;    c_customer_sk,&lt;BR /&gt;    c_last_name,&lt;BR /&gt;    &lt;SPAN class=""&gt;row_number&lt;/SPAN&gt;() &lt;SPAN class=""&gt;OVER&lt;/SPAN&gt; (&lt;SPAN class=""&gt;PARTITION&lt;/SPAN&gt; &lt;SPAN class=""&gt;BY&lt;/SPAN&gt; c_last_name &lt;SPAN class=""&gt;ORDER&lt;/SPAN&gt; &lt;SPAN class=""&gt;BY&lt;/SPAN&gt; c_customer_sk &lt;SPAN class=""&gt;DESC&lt;/SPAN&gt;)&lt;BR /&gt;&lt;SPAN class=""&gt;FROM&lt;/SPAN&gt; samples.tpcds_sf1.customer&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P class=""&gt;After that, we should be able to see the window instrumentation. Great - we’re learning DataFlint and Spark internals at the same time! And as you can see -&amp;gt; Window instrumentation worked this time.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_23-1782288498971.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28223iC44EFF58EE2960C3/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_23-1782288498971.png" alt="szymon_dybczak_23-1782288498971.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;And good news: I asked the DataFlint team about&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;WindowGroupLimitExec&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;support, and they confirmed that it is already planned for the next release. So soon we should have instrumentation for this operator as well - nice.&lt;/P&gt;&lt;H2 id="7590"&gt;SQL nodes instrumentation&lt;/H2&gt;&lt;P class=""&gt;SQL nodes instrumentation casts a much wider net. Instead of a single operator family, it wraps the common physical operators that make up most query plans - filters, projections, joins, sorts, hash/sort aggregates, the file and batch scans, the write command.&lt;/P&gt;&lt;P class=""&gt;When enabled, DataFlint’s DataFlintInstrumentationExtension wraps each SQL physical operator with a TimedExec node that measures actual wall-clock execution time per operator. The result is a duration metric on every&lt;BR /&gt;instrumented node - not an estimate derived from task metrics, but a direct measurement of how long that specific operator spent processing data. This is what powers the heat map: without instrumentation, duration percentages are approximated from stage-level data; with instrumentation, every node carries its own precise timing.&lt;/P&gt;&lt;P class=""&gt;On the screen below, you can see that after enabling SQL node instrumentation, wrapped operators are visible even in the native UI.&lt;/P&gt;&lt;DIV class=""&gt;&lt;BR /&gt;&lt;DIV class=""&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_24-1782288499056.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28226i2C02CF6FE261EB30/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_24-1782288499056.png" alt="szymon_dybczak_24-1782288499056.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;The practical impact is significant. Consider a stage containing a SortMergeJoin followed by a Filter followed by a Project. At the stage level, they all look the same — part of a 3-minute stage. With instrumentation, you might discover the join consumed 2 minutes 50 seconds, and the filter ran in under a second. That distinction is the difference between tuning the right thing and tuning the wrong thing.&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1782289226252.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28227iE159505E1AE93E57/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1782289226252.png" alt="szymon_dybczak_0-1782289226252.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;P class=""&gt;Instrumentation is intentionally opt-in - it rewrites the physical query plan, which carries a small overhead and a compatibility risk with native accelerators like Gluten or Comet. But for workloads where you need to understand performance at operator granularity rather than stage granularity, it transforms the plan view from a structural diagram into a genuine performance profile.&lt;/P&gt;&lt;H2 id="354e"&gt;Conclusion&lt;/H2&gt;&lt;P class=""&gt;DataFlint is an excellent addition to the Spark ecosystem. It makes day-to-day debugging much easier by bringing the most important pieces of information into one place, highlighting suspicious patterns, and helping you move from “something is slow” to “this is probably why” much faster.&lt;/P&gt;&lt;P class=""&gt;I have already started using DataFlint in my daily workflow, and it has made Spark performance investigation feel much less painful. If you work with Spark regularly, I definitely recommend giving it a try.&lt;/P&gt;&lt;P class=""&gt;You can check out the project on GitHub, and if you find it useful, consider giving the repository a star. It’s a simple way to support the project and help more Spark users discover it.&lt;/P&gt;&lt;P class=""&gt;Also, if you are interested in Spark optimization more broadly, I highly recommend checking out the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://www.youtube.com/@Dataflint" target="_blank" rel="noopener ugc nofollow"&gt;DataFlint YouTube&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;channel and the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://bigdataperformance.substack.com/" target="_blank" rel="noopener ugc nofollow"&gt;Big Data Performance Substack&lt;/A&gt;, which is run by one of DataFlint’s founders. Both are great resources for learning more about Spark performance, debugging, and optimization.&lt;/P&gt;&lt;P class=""&gt;And finally, if you want to help improve the Spark debugging experience for everyone, consider contributing to the project.&lt;/P&gt;</description>
      <pubDate>Wed, 24 Jun 2026 08:21:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/dataflint-on-databricks-the-open-source-spark-ui-upgrade-apache/m-p/160365#M1309</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2026-06-24T08:21:16Z</dc:date>
    </item>
    <item>
      <title>A small song for the Databricks Community</title>
      <link>https://community.databricks.com/t5/community-articles/a-small-song-for-the-databricks-community/m-p/160287#M1303</link>
      <description>&lt;P&gt;&lt;div class="video-embed-center video-embed"&gt;&lt;iframe class="embedly-embed" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FbJofjBonxos%3Ffeature%3Doembed&amp;amp;display_name=YouTube&amp;amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DbJofjBonxos&amp;amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FbJofjBonxos%2Fhqdefault.jpg&amp;amp;type=text%2Fhtml&amp;amp;schema=youtube" width="640" height="360" scrolling="no" title="Databricks community Song" frameborder="0" allow="autoplay; fullscreen; encrypted-media; picture-in-picture" allowfullscreen="true"&gt;&lt;/iframe&gt;&lt;/div&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;One Question Can Light the Spark&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;My Databricks journey started in 2022 with simple interest, curiosity, and a dream to learn more.&lt;/P&gt;&lt;P&gt;At that time, I was just trying to understand the platform, follow the updates, learn from others, and slowly build my confidence. Like many learners, I had questions, doubts, and moments where I did not know where to start.&lt;/P&gt;&lt;P&gt;Over time, the Databricks Community became more than a place to read posts. It became a place to learn, ask, share, and grow.&lt;/P&gt;&lt;P&gt;Every question taught me something.&lt;BR /&gt;Every answer gave me direction.&lt;BR /&gt;Every discussion opened a new idea.&lt;BR /&gt;Every contributor inspired me to give back.&lt;/P&gt;&lt;P&gt;From that first interest in 2022 to attending the Data + AI Summit in 2026, this journey has been very special to me. Seeing the Databricks Community in action, meeting people, learning from leaders, and feeling the energy of this ecosystem made me even more grateful.&lt;/P&gt;&lt;P&gt;That is why I created this song for the Databricks Community.&lt;/P&gt;&lt;P&gt;I created it with love, respect, and gratitude for every visitor, learner, contributor, champion, and leader who makes this community meaningful.&lt;/P&gt;&lt;P&gt;Many people come here with a simple question. Some come with errors. Some come with ideas. Some come with big dreams. And this community gives them a place to start.&lt;/P&gt;&lt;P&gt;One helpful answer can give someone confidence.&lt;BR /&gt;One shared experience can save someone hours.&lt;BR /&gt;One kind reply can make someone feel welcome.&lt;BR /&gt;One contribution can inspire many more.&lt;/P&gt;&lt;P&gt;That is the real beauty of the Databricks Community.&lt;/P&gt;&lt;P&gt;Visitors become learners.&lt;BR /&gt;Learners become contributors.&lt;BR /&gt;Contributors become champions.&lt;BR /&gt;Champions inspire the next generation.&lt;BR /&gt;And community leaders continue to build a space where everyone can feel included.&lt;/P&gt;&lt;P&gt;I truly love Databricks and this community. I would love to keep helping in the best way I can, by learning, sharing, supporting others, and inspiring more people to participate.&lt;/P&gt;&lt;P&gt;This song is my small tribute to all of us.&lt;/P&gt;&lt;P&gt;One question can light the spark.&lt;BR /&gt;One answer can guide the heart.&lt;BR /&gt;Together, we learn.&lt;BR /&gt;Together, we share.&lt;BR /&gt;Together, we grow.&lt;/P&gt;&lt;P&gt;Thank you, Databricks Community.&lt;/P&gt;</description>
      <pubDate>Tue, 23 Jun 2026 17:55:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/a-small-song-for-the-databricks-community/m-p/160287#M1303</guid>
      <dc:creator>Brahmareddy</dc:creator>
      <dc:date>2026-06-23T17:55:42Z</dc:date>
    </item>
    <item>
      <title>Databricks CustomerLake</title>
      <link>https://community.databricks.com/t5/community-articles/databricks-customerlake/m-p/160101#M1296</link>
      <description>&lt;H1&gt;Databricks CustomerLake: Inside the Agentic CDP Built for the Age of AI&lt;/H1&gt;&lt;P&gt;&lt;EM&gt;A deep dive into what CustomerLake actually is, how it works, and what it looks like in practice.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="SK7788775_1-1782127631085.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28132i042F5E5E1F76514D/image-size/large?v=v2&amp;amp;px=999" role="button" title="SK7788775_1-1782127631085.png" alt="SK7788775_1-1782127631085.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;HR /&gt;&lt;P&gt;At Data + AI Summit 2026, Databricks announced CustomerLake — an Agentic Customer Data Platform built natively inside the Databricks Lakehouse.&lt;/P&gt;&lt;P&gt;Not a standalone tool. Not a separate layer bolted on top. The thinking behind it is simple: rather than pulling customer data out into yet another platform, bring the CDP capabilities directly into the environment where that data already lives — with governance, security, and data infrastructure already in place.&lt;/P&gt;&lt;P&gt;This post walks through what CustomerLake covers, how it works, and what the product actually looks like — drawing from the official keynote, product demo, and launch materials.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;The Problem CustomerLake Is Solving&lt;/H2&gt;&lt;P&gt;Marketing at most enterprises still follows a familiar sequence. A plan gets defined. Data teams pull together what is needed. Audiences are assembled. A campaign gets configured in some automation tool, pushed out, and then measured. Rinse and repeat.&lt;/P&gt;&lt;P&gt;The cycle has worked well enough — but it moves slowly. Building and refining campaigns typically runs across weeks, sometimes months. And the output, despite all the effort, tends to be broad. The same message going out to large groups, not genuinely tailored to individual customers.&lt;/P&gt;&lt;P&gt;Meanwhile, the buying side is evolving fast. AI agents are now doing research, comparing options, and making purchases on behalf of consumers — always available, reacting to new context almost immediately, operating across a growing number of channels. Marketing built around weekly batch cycles does not keep pace with that.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;The Concept: Infinity Campaigns&lt;/H2&gt;&lt;P&gt;CustomerLake is built around a core idea Databricks calls &lt;STRONG&gt;Infinity Campaigns&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;Traditional campaigns are time-boxed — they start, run to a predefined audience, and end. Infinity Campaigns work differently. They are continuous engagement loops, always running, with no fixed end state. Every customer gets evaluated individually in real time, which means the one-to-many model gives way to something closer to true one-to-one engagement.&lt;/P&gt;&lt;P&gt;The underlying logic: customer actions and signals get picked up by enterprise-side agents, processed against the customer's full profile and context, and a decision gets made — does this person need to hear from us right now, and if so, what and through which channel? When an action is taken, that interaction becomes a new signal, feeding back into the same loop.&lt;/P&gt;&lt;P&gt;Evergreen. Always adapting. No campaign relaunch required.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;What Makes a CDP "Agentic"&lt;/H2&gt;&lt;P&gt;CDPs have historically served three core functions: building a unified customer profile, enabling marketers to define audience segments, and pushing those segments out to execution tools like email or mobile platforms.&lt;/P&gt;&lt;P&gt;The limitation has always been architectural. CDPs lived outside the core data platform. They needed their own copy of the customer data, maintained their own governance layer, and required ongoing data movement to stay current.&lt;/P&gt;&lt;P&gt;For an agentic approach to work, that architecture breaks down. Agents need access to everything — customer history, behavioral context, business rules, predictive models, campaign performance — all in one place, without data movement introducing lag or gaps.&lt;/P&gt;&lt;P&gt;Databricks built CustomerLake around three requirements for what an agentic CDP needs to be:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Embedded in the lakehouse&lt;/STRONG&gt; — customer data, context, and agents share the same infrastructure. No copies, no sync jobs, no reconciliation between systems.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Built around agents as the core operating model&lt;/STRONG&gt; — not a conventional platform with an AI feature added. The agent is how data gets prepared, how audiences get shaped, how campaigns get planned, and how decisions get made per customer.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Capable of true one-to-one personalization at scale&lt;/STRONG&gt; — not segments of thousands, but individual decisions made continuously for every customer in the system.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;The Architecture&lt;/H2&gt;&lt;P&gt;CustomerLake has two main components: &lt;STRONG&gt;Profile Agents&lt;/STRONG&gt; and &lt;STRONG&gt;Campaign Agents&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;Raw customer data flows in, gets processed by Profile Agents into clean unified profiles, and those profiles become the foundation for Campaign Agents to run Infinity Campaigns. A built-in Reverse ETL layer handles pushing decisions and audience data back out to the execution tools that reach customers — email platforms, ad networks, SMS, in-app messaging, and more.&lt;/P&gt;&lt;P&gt;Data sources include anything already sitting in the Databricks Lakehouse, plus external data from MarTech and CRM systems brought in through Lakeflow Connect. Unity Catalog handles governance across the whole stack — the same controls that apply to the rest of the data estate apply here too.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;Profile Agents: Building the Customer 360&lt;/H2&gt;&lt;P&gt;Getting to a reliable, unified customer profile is foundational to everything else. Profile Agents handle the full pipeline to get there.&lt;/P&gt;&lt;H3&gt;Data Preparation&lt;/H3&gt;&lt;P&gt;Lakeflow Connect brings in external data — from CRM platforms, MarTech tools, and third-party sources — alongside whatever is already in the lakehouse. Once a new dataset lands, Genie (Databricks' AI layer) takes over the preparation work.&lt;/P&gt;&lt;P&gt;It reads the dataset, identifies what each column actually represents — email, phone number, full name, address — and applies semantic tags accordingly. It then generates normalization rules to clean and standardize the data automatically, handling inconsistencies and filtering out invalid values without any manual mapping.&lt;/P&gt;&lt;P&gt;Third-party data enrichment is accessible through a &lt;STRONG&gt;Data &amp;amp; Identity Marketplace&lt;/STRONG&gt; — providers can be connected and their data pulled in with a single click.&lt;/P&gt;&lt;H3&gt;Identity Resolution&lt;/H3&gt;&lt;P&gt;Matching records across different data sources — recognizing that two entries with slightly different details actually represent the same person — has always been one of the harder problems in customer data work.&lt;/P&gt;&lt;P&gt;CustomerLake handles this through what Databricks calls &lt;STRONG&gt;Agentic Identity Resolution&lt;/STRONG&gt;, which runs across three stages:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Rules-based matching&lt;/STRONG&gt; covers the straightforward cases — exact matches on unique IDs, normalized email addresses, or combinations like phone number with a fuzzy name match. The rules are readable and configurable.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;LLM review&lt;/STRONG&gt; handles the middle ground — cases where the rules do not reach a confident conclusion. A language model steps in to assess whether two profiles are likely the same person.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Human review&lt;/STRONG&gt; is reserved for the genuinely uncertain — a queue where a person makes the final determination.&lt;/P&gt;&lt;P&gt;What ties this together is a feedback loop. Every decision made at the LLM and human stages gets incorporated back into the rules layer, so each run of the identity resolution process is more accurate than the last. Organizations can also bring their own ML models into the pipeline if they already have them.&lt;/P&gt;&lt;P&gt;When a new data source is added, Genie automatically analyzes it against existing matching rules and recommends additional rules where gaps or opportunities are identified — explaining the reasoning behind each suggestion and previewing the expected impact before anything is applied.&lt;/P&gt;&lt;H3&gt;Gold Customer Table&lt;/H3&gt;&lt;P&gt;The end product of Profile Agents is a &lt;STRONG&gt;Gold Customer Table&lt;/STRONG&gt; — a single governed schema that every data source maps into. Where sources disagree on a field value, survivorship rules decide which one wins. The whole thing is configurable through a UI or YAML, so both technical and non-technical team members can work with it.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;Campaign Agents: From Goal to Individual Decision&lt;/H2&gt;&lt;P&gt;With a clean, unified customer profile in place, Campaign Agents take over — translating business goals into personalized, continuously running campaigns.&lt;/P&gt;&lt;H3&gt;Building Audiences&lt;/H3&gt;&lt;P&gt;Audience creation works through natural language. A marketer describes the audience they need, and Genie builds the segment directly against live lakehouse data. No SQL. No hand-off to a data analyst.&lt;/P&gt;&lt;P&gt;A marketer describes the audience they need in plain language, and Genie builds the segment directly against live lakehouse data — no SQL, no hand-off to a data analyst. Existing audiences can be refined further the same way, by simply describing the additional conditions needed. Genie converts the description into precise data filters and updates the segment instantly.&lt;/P&gt;&lt;P&gt;Audience insights — size trend over time, purchase category breakdown, average spend, churn risk — are surfaced automatically. Suppression rules reference live data conditions rather than point-in-time exports, so someone who converts mid-campaign is removed from eligibility immediately, not at the next scheduled refresh.&lt;/P&gt;&lt;H3&gt;Campaign Planning&lt;/H3&gt;&lt;P&gt;Turning an audience and a goal into a campaign starts with a brief conversation. Genie asks a focused set of questions — which channels to use, how many messages to send, when the campaign should conclude — and uses the answers to generate a structured campaign brief.&lt;/P&gt;&lt;P&gt;The brief covers the full picture: goals and success metrics, a sequenced messaging plan with rationale per touchpoint, timing and cadence, guardrails (frequency limits, opt-out lists, suppression of customers with open support tickets), personalization signals to draw on, and the assumptions behind the plan.&lt;/P&gt;&lt;P&gt;This document becomes the foundation the campaign is built from. It is fully editable before anything gets built.&lt;/P&gt;&lt;H3&gt;Decisioning and Reasoning&lt;/H3&gt;&lt;P&gt;Before going live, Campaign Agents can run a &lt;STRONG&gt;pre-launch simulation&lt;/STRONG&gt; across a sample of real qualified profiles. The simulation shows what the agent would actually do for each person — which message they would receive, whether they would be deferred based on existing campaign load — without triggering any actual sends.&lt;/P&gt;&lt;P&gt;Each profile in the simulation comes with a &lt;STRONG&gt;Reasoning panel&lt;/STRONG&gt;: a plain-language explanation of why that specific message was chosen, which rule it matched, and why the send timing was set the way it was. The agent also accounts for campaigns running in parallel — if a customer is already receiving heavy outreach from another active campaign, that factors into the decision before anything goes out.&lt;/P&gt;&lt;P&gt;This kind of per-profile transparency, available before launch rather than after a complaint, changes how marketers can review and trust the decisioning layer.&lt;/P&gt;&lt;H3&gt;Performance and Activation&lt;/H3&gt;&lt;P&gt;Once a campaign is live, Campaign Agents monitor it continuously — flagging performance trends and suggesting adjustments in real time. Native A/B testing makes variant comparison straightforward across the key engagement metrics.&lt;/P&gt;&lt;P&gt;Activation runs through Reverse ETL — bi-directional connections to the MarTech and AdTech tools already in use, covering email, SMS, in-app, and advertising platforms.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;Early Customers and Partners&lt;/H2&gt;&lt;P&gt;CustomerLake has been in private rollout with select enterprise customers ahead of the public announcement. Early customers include GM, AB InBev, HP, Circle K, Barclays, and Getnet.&lt;/P&gt;&lt;P&gt;The platform launches with an open partner ecosystem spanning identity, activation, measurement, and customer experience — alongside implementation partners supporting deployment at enterprise scale.&lt;/P&gt;&lt;HR /&gt;&lt;H2&gt;Where Things Stand&lt;/H2&gt;&lt;P&gt;CustomerLake is currently available in &lt;STRONG&gt;Private Preview&lt;/STRONG&gt;. Organizations interested in early access should reach out to their Databricks account team.&lt;/P&gt;&lt;P&gt;The product makes the most sense for teams whose data foundation is already on Databricks — the value comes from not having to replicate that foundation elsewhere to support marketing use cases. If the data is already there, the Customer 360, the audiences, and the campaign intelligence can be built on top of it directly, under the same governance that covers the rest of the data estate.&lt;/P&gt;&lt;HR /&gt;&lt;P&gt;&lt;EM&gt;Sources:&lt;/EM&gt; &lt;EM&gt;&lt;A href="https://www.databricks.com/blog/introducing-customerlake-agentic-cdp" target="_blank"&gt;Introducing CustomerLake: The Agentic CDP embedded in Databricks — Databricks Blog&lt;/A&gt;&lt;/EM&gt; &lt;EM&gt;&lt;A href="https://www.youtube.com/watch?v=lpS01YiHqjU" target="_blank"&gt;Introducing Databricks CustomerLake — Official YouTube&lt;/A&gt;&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2026 11:28:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/databricks-customerlake/m-p/160101#M1296</guid>
      <dc:creator>ShivamKumar7788</dc:creator>
      <dc:date>2026-06-22T11:28:42Z</dc:date>
    </item>
    <item>
      <title>Medallion Architecture Has 3 Layers. We Built 5. Here's Why — Views Layer Design on Databricks</title>
      <link>https://community.databricks.com/t5/community-articles/medallion-architecture-has-3-layers-we-built-5-here-s-why-views/m-p/160041#M1294</link>
      <description>&lt;P class=""&gt;&lt;FONT size="3"&gt;Part 4 of my enterprise data platform series is up - this one cover why we added a fifth layer to the standard medallion architecture.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT size="3"&gt;We connected BI tools to the Gold layer and immediately hit four problems Gold alone couldn't solve:&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT size="3"&gt;Schema breaks when we renamed a Gold column (three Tableau reports broke immediately)&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;Three workbooks calculating vendor aging differently, none of them agreeing&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;The same vendor_master join running independently across four dashboards&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;Row-level filtering that we didn't want duplicated in every downstream tool&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;FONT size="3"&gt;All four solutions pointed to the same thing - a Views layer between Gold and consumers.&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&lt;FONT size="3"&gt;What's in the post:&lt;/FONT&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT size="3"&gt;Schema stability via views - one update instead of fixing every downstream query&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;Business logic abstraction - vendor aging buckets defined once, consumed everywhere&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;Row-level security with dynamic views using current_user() and Unity Catalog&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;Pre-joined views for heavy consumers with Databricks Dashboard query cache&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;View naming convention, Git-based version control for view definitions&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;What I'd do differently - designing Views before Gold, not after&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;FONT size="3"&gt;Full post on Medium: &lt;A class="" href="https://medium.com/@savlahanish/medallion-architecture-has-3-layers-we-built-5-heres-why-41408c71c6b7" target="_blank" rel="noopener"&gt;https://medium.com/@savlahanish/medallion-architecture-has-3-layers-we-built-5-heres-why-41408c71c6b7&lt;/A&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;FONT size="3"&gt;Part 5 is where it gets messier - Tableau and Databricks Dashboards behaving differently against the same views, a decimal precision issue that cost two hours, and what happens when interactive queries and Tableau batch refreshes hit the same SQL Warehouse at 9am.&lt;/FONT&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;FONT size="3"&gt;Happy to answer questions on any of the decisions - particularly around the row-level security pattern or the naming convention.&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2026 07:40:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/medallion-architecture-has-3-layers-we-built-5-here-s-why-views/m-p/160041#M1294</guid>
      <dc:creator>savlahanish27</dc:creator>
      <dc:date>2026-06-22T07:40:25Z</dc:date>
    </item>
    <item>
      <title>Metric Views with Power BI and Tabular Editor (Part 3 of 3)</title>
      <link>https://community.databricks.com/t5/community-articles/metric-views-with-power-bi-and-tabular-editor-part-3-of-3/m-p/160040#M1293</link>
      <description>&lt;P class=""&gt;&lt;STRONG&gt;&lt;EM&gt;This is part 3 of 3 in a series where I take you through working with Metric Views.&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;EM&gt;Part 1: Introduction to Metric Views&lt;/EM&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;EM&gt;Part 2: Metric Views and the Databricks platform (AI/BI Dashboards, Genie, etc.)&lt;/EM&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;EM&gt;Part 3: Metric Views with Power BI and Tabular Editor&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;H3&gt;Not all semantic models are created equally&lt;/H3&gt;&lt;P class=""&gt;While a semantic model is a general concept that can be applied across a lot of different systems, I think it is no coincidence that Microsoft uses this exact term in Power BI to describe their models that are underpinning their reporting capabilities. As a result of this, people might associate the Semantic Model directly with Power BI for this reason, but as described in Part 1, we consider Metric Views to be a Semantic Model of its own as well.&lt;/P&gt;&lt;P class=""&gt;In order to understand a bit about how and why things work the way they do, I think it might be a good idea to highlight some core concepts of Semantic Models, and how they work a little different between Databricks and Power BI.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Different models, same purpose&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;In my mind, a Semantic Model in Power BI are synonymous with a star-schema model. This is mainly due to the fact that the Power BI engine is designed around the star-schema, which means that it evaluates queries faster when the model is designed like this. When loaded into the semantic model, a star schema preserves each of the tables on their own.&lt;/P&gt;&lt;P class=""&gt;In contrast to the star schema, a Databricks Metric View resembles more of a One-Big-Table style of semantic model. So even though we have facts and dimensions stored in separate tables, the Metric View itself is defined as a single view. This does introduce some issues, such as what to do in the case of a multi-fact model, or how to solve different granularities such as time. That being said, I do expect the Metric View to evolve to possibly handle some of these cases in the future.&lt;/P&gt;&lt;P class=""&gt;For more details on the different types of Semantic Models, and their pros/cons I recommend this article by &lt;A class="" href="https://www.linkedin.com/article/edit/7396113425862406144/#" target="_blank" rel="noopener noreferrer nofollow"&gt;Kurt Buhler&lt;/A&gt;, written for Tabular Editor. This is centered around Power BI, but the general concepts are applicable across different BI tools.&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;A href="https://tabulareditor.com/blog/data-model-types-examples-and-tips-for-power-bi-part-2" target="_self"&gt;Data model types, examples, and tips for Power BI&lt;/A&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;HR /&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;Loading Metric Views in Power BI - Natively&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;Databricks Metric Views provide a unified semantic layer that can be queried via SQL. We have dimensions, measures and even some nice metadata descriptions, type hints and similar resources defined. Therefore, you might have the same idea as me.&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P class=""&gt;&lt;EM&gt;We should be able to use the Metric View as the source for a Power BI report directly&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P class=""&gt;With my complete naive and optimistic approach, I went ahead hoping that at least some of the model could be loaded and used as it was defined in Databricks. Well, unfortunately that was not at all the case. In fact, if you try to setup a Power BI model querying a Metric View using the native Databricks connector, pointing directly as your Metric View, you will not even be able to get anywhere. You will instead be met with the below error message.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="KrisJohannesen_1-1782113562227.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28097iDAE7DAB4E5C076E0/image-size/medium?v=v2&amp;amp;px=400" role="button" title="KrisJohannesen_1-1782113562227.png" alt="KrisJohannesen_1-1782113562227.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P class=""&gt;In short, the above error tells you that in order to query a Metric View in SQL, you need to use the MEASURE() reference in each of the defined measures. Honestly, this makes sense, since it is the same approach you need when querying the Metric View inside of Databricks' own platform.&lt;/P&gt;&lt;P class=""&gt;&lt;A href="https://www.linkedin.com/feed/update/urn:li:activity:7390475367553302528" target="_self"&gt;Here&lt;/A&gt;, I have added a short video shared by Databricks, where &lt;A class="" href="https://www.linkedin.com/article/edit/7396113425862406144/#" target="_blank" rel="noopener noreferrer nofollow"&gt;Simon Whiteley&lt;/A&gt; demonstrates exactly how a SQL Query of Metric Views actually works in practice.&lt;/P&gt;&lt;DIV class=""&gt;&lt;HR /&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;Alternative approaches and workarounds&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;Alright, so since there is no native support, and it unfortunately does not exist on any roadmaps yet, I was wondering if we can do something else to actually leverage the definitions we have already created. The obvious solution is of course to lean into the Databricks native AI/BI dashboards as described in &lt;A class="" href="https://www.linkedin.com/pulse/metric-views-aibi-dashboards-genie-part-2-3-kristian-johannesen-tbhbf/" target="_blank" rel="noopener noreferrer"&gt;Part 2&lt;/A&gt;, but I also know that some companies are so heavily invested in Power BI that this would not be a viable approach, at least not in the short-term.&lt;/P&gt;&lt;P class=""&gt;So I set out to test some workarounds. I am not aiming for this to be the perfect solution, so to be honest, I have fairly low expectations of anything actually working dynamically in terms of measure evaluation. I know that this is honestly not very useful when comparing to a DAX measure that is automatically evaluated at the granularity reflected, but let's see if we can work some magic.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Load to Power BI using SQL Query&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;While this might not be the most elegant of solutions, and you do end up losing quite a lot of the underlying logic of the Metric View itself, you do have the possibility of loading the Metric View using a SQL Query. This can be done in one of two ways, both of which are not that elegant:&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Create a View on top of your Metric View&lt;/STRONG&gt; (I know - this sounds dumb). In the view, the measures of the Metric View need to be referenced as MEASURE(measure 1) as 'measure 1' while the dimensions can be referenced directly by name.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Create a SQL query against Databricks directly&lt;/STRONG&gt;. When setting up your Databricks connection, use a SQL query and write up the same definition as you would for the above view. Same concept, but you do not get an additional object inside of Databricks - instead you get a SQL query that lives inside your M partition.&lt;/P&gt;&lt;P class=""&gt;Both of these approaches allows you to load the columns correctly to Power BI, however you are essentially left with a regular view, with no annotations, metadata or measure definitions.&lt;/P&gt;&lt;P class=""&gt;None of these approaches actually achieve any benefit in having the Metric View. They would both work just as well by just using a regular View from the beginning - or even better - by loading the fact and dimensions from tables/views separately and defining your model inside of Tabular Editor (or Power BI Desktop). With the correct metadata in Unity Catalog, you would even get the Relationships defined automatically. Check out &lt;A href="https://tabulareditor.com/blog/tabular-editor-x-databricks-part-5" target="_self"&gt;this article&lt;/A&gt; (and the rest of the series), written by &lt;A class="" href="https://www.linkedin.com/article/edit/7396113425862406144/#" target="_blank" rel="noopener noreferrer nofollow"&gt;&lt;span class="lia-unicode-emoji" title=":skull:"&gt;💀&lt;/span&gt; Johnny Winter &lt;span class="lia-unicode-emoji" title=":skull:"&gt;💀&lt;/span&gt;&lt;/A&gt; for some great tips and tricks.&lt;/P&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;&lt;STRONG&gt;Tabular Editors Semantic Bridge:&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;In November 2025, &lt;A class="" href="https://www.linkedin.com/article/edit/7396113425862406144/#" target="_blank" rel="noopener noreferrer nofollow"&gt;Greg Baldini&lt;/A&gt; from &lt;A class="" href="https://www.linkedin.com/article/edit/7396113425862406144/#" target="_blank" rel="noopener noreferrer nofollow"&gt;Tabular Editor&lt;/A&gt; went on the Explicit Measures Podcast to discuss their new MVP feature called the Semantic Bridge that they are working on alongside Advancing Analytics. The Semantic Bridge introduction and discussions starts at around minute 22:00 - but honestly the whole thing is really interesting.&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN&gt;This feature is meant to work across multiple semantic layers, and might be the missing link that can actually enable us to work with Metric Views in Power BI. While this is still in its early stages, I found the discussions around how to work across different formats and syntax really interesting and I can't wait to see what we can do with this feature in the future.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;EM&gt;&lt;SPAN&gt;There will be a Bonus Episode to this series with more details on the Semantic Bridge specifically!&lt;/SPAN&gt;&lt;/EM&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;HR /&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;Conclusions&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P class=""&gt;&lt;STRONG&gt;No Native Integration (Yet):&lt;/STRONG&gt; Power BI &lt;STRONG&gt;does not natively support Databricks Metric Views&lt;/STRONG&gt;. While the Databricks connector allows listing them from Unity Catalog, it cannot query them directly due to the required use of MEASURE() and GROUP BY logic that Power BI does not generate automatically. At this stage, Metric Views are most useful within the Databricks environment itself. Native AI/BI dashboards, and tools like Genie natively understand and correctly interpret Metric Views with no workarounds&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;SQL Workarounds Require Redefinition:&lt;/STRONG&gt; The only functional workaround today is to write custom SQL queries in Databricks that explicitly call MEASURE() and expose the result as a &lt;STRONG&gt;view&lt;/STRONG&gt; or &lt;STRONG&gt;table&lt;/STRONG&gt;. These can then be imported into Power BI. However, this approach &lt;STRONG&gt;redefines the logic&lt;/STRONG&gt; of the Metric View, weakening the promise of a single source of truth and also requires you to re-define your measures as DAX on the back of the integration.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Tabular Editor &amp;amp; Semantic Bridge: &lt;/STRONG&gt;This might be the short term solution to cross-integration between the two platforms that allows us to translate one to the other and vice-versa, however it is still not a direct connection.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;The Community is asking for more:&lt;/STRONG&gt; There is growing demand for a native Power BI integration. Want to support it yourself, then check out this &lt;A class="" href="https://community.fabric.microsoft.com/t5/Fabric-Ideas/Enable-native-Power-BI-integration-with-Databricks-Metric-View/idi-p/4823684" target="_blank" rel="noopener noreferrer"&gt;Microsoft Fabric community idea&lt;/A&gt;.&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P class=""&gt;If you are really deep into Databricks already, it might be worth considering if the time has come to move your Analytics and Dashboards from Power BI into Databricks. But that's a topic for another day!&lt;/P&gt;&lt;/BLOCKQUOTE&gt;</description>
      <pubDate>Mon, 22 Jun 2026 07:37:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/metric-views-with-power-bi-and-tabular-editor-part-3-of-3/m-p/160040#M1293</guid>
      <dc:creator>KrisJohannesen</dc:creator>
      <dc:date>2026-06-22T07:37:32Z</dc:date>
    </item>
    <item>
      <title>Getting Certified as a Databricks Generative AI Engineer Associate: Key Takeaways and Insights</title>
      <link>https://community.databricks.com/t5/community-articles/getting-certified-as-a-databricks-generative-ai-engineer/m-p/160026#M1292</link>
      <description>&lt;H1&gt;&lt;FONT face="times new roman,times"&gt;&lt;FONT size="4"&gt;I just earned my Databricks Certified Generative AI Engineer Associate Certification, and in this post, I’m sharing the key tips, resources, and including what confused me, what actually worked, and the traps I nearly fell into.&amp;nbsp;&lt;/FONT&gt;&lt;/FONT&gt;&lt;/H1&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="AngelShrestha_0-1782102769751.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28094i021E3521C01474D8/image-size/medium?v=v2&amp;amp;px=400" role="button" title="AngelShrestha_0-1782102769751.png" alt="AngelShrestha_0-1782102769751.png" /&gt;&lt;/span&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;H1&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;&lt;BR /&gt;&lt;FONT color="#339966"&gt;Why I Took This Exam&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H1&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;I work across building scalable ML and Gen AI&amp;nbsp; solutions and architecture, which means staying current on the GenAI stack is a practical requirement, not just a resume item. While working on a recent project, I started exploring Databricks more deeply, and I found a platform that have evolved from data engineering into a serious end-to-end system for building production AI applications, from data ingestion all the way to agents, monitoring, and governance.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;I'm sharing this not as a polished success story, but as an honest account of the preparation process; including the topics that genuinely confused me, and what actually helped. I hope it's useful whether you're just starting to explore the platform or actively preparing for the exam.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;H1&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;&lt;BR /&gt;&lt;FONT color="#339966"&gt;About the Exam&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H1&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;The Databricks Certified Generative AI Engineer Associate tests the full lifecycle of building GenAI applications on Databricks; from design and data preparation through to deployment and monitoring. Approximately 56 multiple-choice questions in 90 minutes, including some unscored questions.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;H2&gt;&lt;FONT face="times new roman,times" color="#666699"&gt;&lt;STRONG&gt;Domain Breakdown&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Domain&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Weight&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Focus Area&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Design Applications&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;14%&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;Prompt design, model selection&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Data Preparation&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;14%&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;Chunking, embeddings, vector search&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Application Development&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;30% (heaviest)&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;Agent tools, frameworks, deployment patterns&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Assembling &amp;amp; Deploying Apps&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;22%&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;MLflow, Model Serving, CI/CD, Apps&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Governance&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;8%&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;Unity Catalog, access control, lineage&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Evaluation &amp;amp; Monitoring&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;12%&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;MLflow judges, monitoring pipelines&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H1&gt;&lt;FONT face="times new roman,times" size="3"&gt;&lt;SPAN&gt;For a detailed overview, access the complete&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;FONT face="times new roman,times" size="3"&gt;&lt;A href="https://www.databricks.com/sites/default/files/2025-04/databricks-certified-generative-ai-engineer-associate-guide.pdf" target="_blank" rel="noopener"&gt;&lt;SPAN&gt;exam guide&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/H1&gt;&lt;H1&gt;&amp;nbsp;&lt;/H1&gt;&lt;H1&gt;&lt;FONT face="times new roman,times" color="#339966"&gt;&lt;STRONG&gt;What Actually Helped Me Prepare&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H1&gt;&lt;H2&gt;&lt;FONT face="times new roman,times" color="#666699"&gt;&lt;STRONG&gt;1.&amp;nbsp; The Four Official ILT Courses&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;I completed the instructor-led track end-to-end. All four. In order. These are well-structured and having a live instructor to ask questions made a real difference when concepts felt confusing.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;UL class="lia-align-justify"&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Building Retrieval Agents on Databricks&lt;/STRONG&gt;&lt;SPAN&gt; — RAG pipelines, embeddings, Vector Search, chunking strategies, MLflow tracing for agents&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Building Single-Agent Applications&lt;/STRONG&gt;&lt;SPAN&gt; — UC function tools, LangChain integration, ResponsesAgent, MLflow logging and reproducibility, Agent Bricks&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Generative AI Application Evaluation and Governance&lt;/STRONG&gt;&lt;SPAN&gt; — MLflow judges (built-in, guideline, custom), offline vs online evaluation, the Review App, human feedback loops&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Generative AI Deployment and Monitoring&lt;/STRONG&gt;&lt;SPAN&gt; — Batch vs real-time deployment, Lakehouse Monitoring, LLMOps vs MLOps, Databricks Asset Bundles&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;The courses provide a strong mental model for building and operating GenAI applications, and the hands-on labs reinforce the concepts as you learn them.&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;You can explore and register for these courses through the &lt;/SPAN&gt;&lt;A href="https://www.databricks.com/training/catalog?levels=onboarding&amp;amp;types=led" target="_blank" rel="noopener"&gt;&lt;SPAN&gt;Databricks Training Catalog&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN&gt;: Databricks Training Catalog. Some courses are free, while others are paid.&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;H2&gt;&lt;FONT face="times new roman,times" color="#666699"&gt;&lt;STRONG&gt;2. &lt;/STRONG&gt;&lt;STRONG&gt;Demo notebooks and Labs&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;I also went through the hands-on demos and labs for each module. This will help you gain practical knowledge of concepts on Databricks .&lt;BR /&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Note:&lt;/SPAN&gt;&lt;I&gt;&lt;SPAN&gt; The self-paced courses are free to access, but the &lt;/SPAN&gt;&lt;/I&gt;&lt;STRONG&gt;&lt;I&gt;demo/lab notebooks require an annual subscription&lt;/I&gt;&lt;/STRONG&gt;&lt;I&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;H2&gt;&lt;FONT face="times new roman,times" color="#666699"&gt;&lt;STRONG&gt;3.&amp;nbsp; Going Deep on the Official Documentation&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;After completing the courses, I spent time going through the documentation for each topic they covered. The docs are the most reliable source for exam-specific details and help fill in many of the gaps that the courses only touch on at a high level.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;I highly recommend reading everything in the Databricks Agents documentation:&lt;/SPAN&gt;&lt;A href="https://docs.databricks.com/aws/en/agents/?utm_source=chatgpt.com" target="_blank" rel="noopener"&gt; &lt;SPAN&gt;Databricks Agents Documentation&lt;/SPAN&gt;&lt;/A&gt;&lt;SPAN&gt;. It covers a large portion of the theoretical knowledge that is in depth for the concepts in the training courses.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;H2&gt;&lt;FONT face="times new roman,times" color="#666699"&gt;&lt;STRONG&gt;4.&amp;nbsp; A Decision-Table Revision System (with AI)&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;This was one of the most effective things I did. I used AI, specifically Claude as a study partner, not to get answers handed to me, but to work through concepts conversationally, then consolidate everything into a structured revision document focused on the comparison layer.&lt;BR /&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;The exam doesn't reward definitions. It rewards scenario reading, understanding which option is correct given specific constraints buried in a paragraph. Many questions include subtle details that change the correct answer.&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;Instead of creating notes like "Vector Search exists," I focused on comparison-based revision tables such as:&lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;UL class="lia-align-justify"&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Structure-aware vs semantic vs fixed-size chunking: when each is correct and why&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Standard vs Storage-Optimized Vector Search endpoints : the multi-constraint decision&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Continuous vs triggered sync:&amp;nbsp; matched to data update cadence&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Delta Sync vs Direct CRUD:&amp;nbsp; when lineage matters vs when it doesn't&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Pay-per-token vs Provisioned throughput - what you use according to your consumption to lower cost.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Batch (ai_query) vs real-time Model Serving: based on latency and use case&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Reference-free vs reference-based MLflow judges:&amp;nbsp; know which requires ground truth&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;By organizing concepts as decisions rather than definitions, I found it much easier to recognize the correct answer when presented with real-world scenarios on the exam.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;H1&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;&lt;BR /&gt;&lt;FONT color="#339966"&gt;Exam Day Tips&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H1&gt;&lt;UL&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;You have enough time.&amp;nbsp;&amp;nbsp;&lt;/STRONG&gt;&lt;SPAN&gt;56 questions in 90 minutes. I finished in 77 minutes with time to review. Don't rush. Use the mark-for-review feature and do a second pass on anything uncertain.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Read the full scenario before the options. &lt;/STRONG&gt;&lt;SPAN&gt;The constraints buried in the middle of the paragraph often determine the correct answer. Options A and B may look equally plausible until you notice a latency or cost constraint you initially skipped.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Diagnose before you answer. &lt;/STRONG&gt;&lt;SPAN&gt;For questions describing a problem , wrong tool call order, slow latency, poor retrieval;&amp;nbsp; train yourself to identify which component in the pipeline is actually failing before reading the options.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Code questions are read, not write. &lt;/STRONG&gt;&lt;SPAN&gt;You might never be asked to write code from scratch. You will be asked to read a snippet and identify what is wrong, what it does, or why it behaves unexpectedly. The key skill is recognising common anti-patterns.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;The exam is more conceptual than Databricks-syntax-heavy. &lt;/STRONG&gt;&lt;SPAN&gt;General GenAI knowledge matters: hallucination types, RLHF mechanics, RAG vs fine-tuning tradeoffs. The courses assume this background. Address that gap directly if you're light on it.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;H1&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;&lt;BR /&gt;&lt;FONT color="#339966"&gt;Topics That Required Extra Attention: Personal View&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H1&gt;&lt;H2&gt;&lt;FONT face="times new roman,times" color="#666699"&gt;&lt;STRONG&gt;Topic 1: Chunking Strategy Selection&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;All chunking strategies sound similar until you need to choose between them under exam pressure. The clearest framing I found:&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Scenario Signal&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Use This&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Consistent headings or sections in the document&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Structure-aware:&amp;nbsp; boundaries already exist, use them&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;No explicit structure, prose flows naturally&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Embedding-based semantic:&amp;nbsp; detects topic shifts via similarity&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Context getting cut off at chunk boundaries&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Add 10–20% overlap: prevents split-concept retrieval failure&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Both specific and broad user questions expected&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Parent Document Retrieval: small chunks for precision, parent for context&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Approaching the embedding model's token limit&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Sub-chunk:&amp;nbsp; databricks-gte-large-en silently truncates at 1024 tokens, no error&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H3&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;The Silent Truncation Trap&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H3&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Embedding models don't error on oversized input, they silently truncate. Content beyond the token limit is simply never represented in the embedding vector. This is one of the most commonly missed details in exam questions. There's no warning, no exception, no indication anything went wrong.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;H2&gt;&lt;FONT face="times new roman,times" color="#666699"&gt;&lt;STRONG&gt;Topic 2: Vector Search Configuration&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;The Standard vs Storage-Optimized decision depends on the combination of constraints given in a scenario. Checking only one factor leads to the wrong answer.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Choose This&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;When&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Standard endpoint&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Strict latency (&amp;lt;200ms), high QPS (100+), smaller index (&amp;lt;2M vectors)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Storage-Optimized&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Large index (10M+ vectors), cost is priority, 500ms+ latency acceptable&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Continuous sync&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Data changes in real-time or near-real-time (minutes)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Triggered sync&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Scheduled updates:&amp;nbsp; match frequency to actual cadence&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Direct CRUD API&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Real-time vector insertion with no Delta table backing it&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H2&gt;&lt;FONT face="times new roman,times" color="#666699"&gt;&lt;STRONG&gt;Topic 3: Deployment Patterns and Code Anti-Patterns&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Specific things kept appearing in practice scenarios:&lt;BR /&gt;&lt;/SPAN&gt;&lt;STRONG&gt;Delta Sync vs Direct CRUD: &lt;/STRONG&gt;&lt;SPAN&gt;Delta Sync is right when your source data lives in Delta and you want full lineage, governance, and rebuild capability. Direct CRUD is right when you need real-time vector insertions without a Delta backing table.&lt;BR /&gt;&lt;/SPAN&gt;&lt;STRONG&gt;Incremental updates: &lt;/STRONG&gt;&lt;SPAN&gt;Only processing changed documents requires enabling delta.enableChangeDataFeed on your Delta table and using MERGE INTO rather than truncate-and-reload. Without this, a nightly pipeline re-processes 100,000 unchanged documents when only 200 actually changed.&lt;BR /&gt;&lt;/SPAN&gt;&lt;STRONG&gt;Critical anti-pattern: &lt;/STRONG&gt;&lt;SPAN&gt;Never put expensive initializations (database clients, model connections) inside predict() in a PyFunc model. That runs on every request. They belong in load_context(), which runs once at model load. The symptom: every request is slow, not just the first.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;H2&gt;&lt;FONT face="times new roman,times" color="#666699"&gt;&lt;STRONG&gt;Topic 4: Model Selection Without Hands-On Experience&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;If you haven't worked across different model families, the exam tests tradeoffs you may never have consciously thought about. The ones that came up:&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;UL class="lia-align-justify"&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Latency vs quality: &lt;/STRONG&gt;&lt;SPAN&gt;A 7B model at 150ms may be the only viable choice over a higher-accuracy 34B model at 1,800ms when the SLA is 200ms. Better benchmark score is irrelevant if the model can't meet the constraint.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Multilingual requirements: &lt;/STRONG&gt;&lt;SPAN&gt;English-only embedding models (databricks-gte-large-en, bge-large-en) produce poor embeddings for non-English content regardless of quality. Multilingual scenario = multilingual model.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Tool-calling capability: &lt;/STRONG&gt;&lt;SPAN&gt;Not all LLMs support function/tool calling. If a model never calls tools during testing, this is the most likely explanation.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Task-specific fit: &lt;/STRONG&gt;&lt;SPAN&gt;A narrow fixed-category classification task at high volume (40,000 daily requests) is better served by a small fine-tuned classifier than a large general-purpose LLM; on both latency and cost per inference.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Evaluation metrics by task: &lt;/STRONG&gt;&lt;SPAN&gt;HumanEval for code generation, BLEU/ROUGE for translation, domain-specific benchmarks for everything else. Highest overall score ≠ best fit for your task.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;H2&gt;&lt;FONT face="times new roman,times" color="#666699"&gt;&lt;STRONG&gt;Topic 5: AI Gateway : Three Features, Three Jobs&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Likely to appear on the exam, and the three features are easy to conflate. Know exactly which one solves which problem:&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Feature&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Solves&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Inference Tables&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Full audit trail: complete request/response payload per interaction, queryable by timestamp&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Usage Tables&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Cost attribution: aggregated token consumption by team/endpoint for chargeback&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Rate Limiting&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Enforcement: cap requests per user or service principal regardless of which app is calling&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H2&gt;&lt;FONT face="times new roman,times" color="#666699"&gt;&lt;STRONG&gt;Topic 6: Evaluation Judges: Ground Truth Requirements&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;This distinction comes up directly in exam questions. Know it cold before exam day:&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Judge&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Needs Ground Truth?&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Notes&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Correctness&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;✓ YES&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;needs expectations field&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;RetrievalSufficiency&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;✓ YES&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;needs expectations field&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;RelevanceToQuery&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;✗ NO&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;reference-free&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;RetrievalGroundedness&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;✗ NO&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;reference-free&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;RetrievalRelevance&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;✗ NO&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;reference-free&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Safety&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;✗ NO&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;reference-free&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H2&gt;&lt;FONT face="times new roman,times" color="#666699"&gt;&lt;STRONG&gt;Topic 7: The Monitoring Pipeline: Understand Why, Not Just What&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;The sequence is:&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;Inference Table&amp;nbsp; →&amp;nbsp; Structured Streaming (unpack raw JSON)&amp;nbsp; →&amp;nbsp; processed Delta table (CDF enabled)&amp;nbsp; →&amp;nbsp; Lakehouse Monitor (Time Series profile)&amp;nbsp; →&amp;nbsp; profile and drift metrics tables&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Understanding why each step exists matters more than memorising the sequence. You can't run meaningful monitoring directly on the raw inference table because request/response payloads are stored as opaque JSON strings; monitoring them computes statistics on string length, not on actual semantic content. Unpacking first gives you toxicity scores, response length distributions, and anything semantically meaningful.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;H2&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;&lt;FONT color="#666699"&gt;Topic 8: Agent Bricks: Knowing When NOT to Use Them&lt;/FONT&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H2&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Agent Brick Type&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;STRONG&gt;Right Scenario&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Knowledge Assistant&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;RAG over documents with citations. No ML expertise needed. Fast time-to-production.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Information Extraction&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;High-volume unstructured to structured field extraction to a Delta table.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Multi-Agent Supervisor&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Routing between structured (Genie/SQL) and unstructured (RAG) sources. Can also run as single agent with just a toolkit.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Custom LLM&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Strict tone, format, or compliance requirements baked into the model; not just a system prompt.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P class="lia-align-justify"&gt;&amp;nbsp;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&amp;nbsp;&lt;/TD&gt;&lt;TD&gt;&lt;P&gt;&lt;FONT face="times new roman,times"&gt;&lt;I&gt;&lt;SPAN&gt;Exam trap: If an agent already exists and is working, don't rebuild it with Agent Bricks. Extend it. Agent Bricks is for starting from scratch when the use case fits a known pattern.&lt;/SPAN&gt;&lt;/I&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H1&gt;&lt;FONT face="times new roman,times" color="#339966"&gt;&lt;STRONG&gt;Final Thoughts&lt;/STRONG&gt;&lt;/FONT&gt;&lt;/H1&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;This certification covers material that maps directly to real production work. The preparation process pushed me to understand not just what each Databricks tool does, but when to choose it over the alternatives; which is the thinking that actually matters when designing real systems.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P class="lia-align-justify"&gt;&lt;FONT face="times new roman,times"&gt;&lt;SPAN&gt;Go beyond the courses. Build your own comparison-layer reference. Pay close attention to the 'when to use what' questions. That's where this exam lives. &lt;/SPAN&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;I'm confident I'll be applying these skills right away in the solutions and architectures I design. It's one of those certifications where the knowledge gained has immediate practical value and translates directly into real-world impact.&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;&lt;P class="lia-align-justify"&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2026 04:55:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/getting-certified-as-a-databricks-generative-ai-engineer/m-p/160026#M1292</guid>
      <dc:creator>AngelShrestha</dc:creator>
      <dc:date>2026-06-22T04:55:05Z</dc:date>
    </item>
    <item>
      <title>200,000 strong and just getting started. My Data and AI Summit 2026</title>
      <link>https://community.databricks.com/t5/community-articles/200-000-strong-and-just-getting-started-my-data-and-ai-summit/m-p/159926#M1291</link>
      <description>&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Brahmareddy_0-1781896039284.png" style="width: 830px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/28063iB5D6BF02AD7652DD/image-dimensions/830x467?v=v2" width="830" height="467" role="button" title="Brahmareddy_0-1781896039284.png" alt="Brahmareddy_0-1781896039284.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P class=""&gt;Just back from Data and AI Summit 2026 and I am still buzzing.&lt;/P&gt;&lt;P class=""&gt;This was my best summit yet. Moscone was packed. More than 30,000 of us in one place, from over 150 countries, all there for the same reason. To build better with data and AI.&lt;/P&gt;&lt;P class=""&gt;The keynotes set the tone. Ali, Matei, Arsalan and Reynold on the main stage. Satya Nadella and Greg Brockman joining in. The message was clear. The era of agents is here, and it runs on three things. Context, control and choice.&lt;/P&gt;&lt;P class=""&gt;That hit home for me. I have been saying for a while that agents are only as good as the data substrate underneath them. Good data engineering is not a nice to have anymore. It is the precondition for autonomy. This summit made that real.&lt;/P&gt;&lt;P class=""&gt;The product news backed it up. Lakebase is now doing 12 million database launches a day. Agent Bricks crossed 100,000 agents built and is processing more than a quadrillion tokens a year. Genie is moving from a chat box to a real coworker. And Free Edition, which I use every day, now ships Genie Code, serverless GPUs, Lakebase, Agent Bricks and Lakeflow Designer. The full toolkit, no cost. That last one matters a lot to me. Most of my POCs run on Free Edition.&lt;/P&gt;&lt;P class=""&gt;The whole week felt like the platform moving in the same direction I have been writing about. Context first. Apps on top. Governance around it.&lt;/P&gt;&lt;P class=""&gt;What I am most proud of though is the community.&lt;/P&gt;&lt;P class=""&gt;We crossed 200,000 members. Let that sink in. 200,000 practitioners helping each other ship real work. None of this happens by accident. The community team carries it.&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/174591"&gt;@MandyR&lt;/a&gt;,&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/152834"&gt;@Advika&lt;/a&gt;, and&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/5"&gt;@Sujitha&lt;/a&gt;&amp;nbsp;Thank you. You lead with care and you make this place feel like home for every new person who walks in.&lt;/P&gt;&lt;P class=""&gt;I also met many Databricks leaders this week. The hallway conversations were as valuable as the sessions. Their inputs gave me a lot to think about and a lot to build.&lt;/P&gt;&lt;P class=""&gt;Being a Community Champion is a privilege I do not take lightly. My goal is simple. Help other members do their work better and grow their careers. through my community posts, through POCs, through Databricks developer community. We grow when we lift each other.&lt;/P&gt;&lt;P class=""&gt;Databricks is not just keeping pace in data and AI. It is setting the direction. And we get to build on top of it together.&lt;/P&gt;&lt;P class=""&gt;Already counting down to the next one. Let us make this community even more impactful.&lt;/P&gt;&lt;P class=""&gt;Who else was there? Tell me your top moment.&lt;/P&gt;</description>
      <pubDate>Fri, 19 Jun 2026 19:10:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/200-000-strong-and-just-getting-started-my-data-and-ai-summit/m-p/159926#M1291</guid>
      <dc:creator>Brahmareddy</dc:creator>
      <dc:date>2026-06-19T19:10:36Z</dc:date>
    </item>
    <item>
      <title>How a Partitioning Mistake Turned a 12-Minute Databricks Job Into a 2-Hour Nightmare</title>
      <link>https://community.databricks.com/t5/community-articles/how-a-partitioning-mistake-turned-a-12-minute-databricks-job/m-p/159918#M1290</link>
      <description>&lt;P&gt;Hello Databricks Community!&lt;/P&gt;&lt;P&gt;I recently published a detailed breakdown on Medium about a real-world optimization nightmare we faced, and I wanted to share the core lessons learned with this group.&lt;/P&gt;&lt;P&gt;We had a highly efficient Delta table pipeline handling &lt;STRONG&gt;1.2 billion records&lt;/STRONG&gt; that completed its hourly incremental updates in just &lt;STRONG&gt;12 minutes&lt;/STRONG&gt;. In a bid to speed up specific queries, we made a seemingly logical choice: partitioning the table by a high-cardinality column (TransactionID).&lt;/P&gt;&lt;P&gt;Instead of speeding things up, this single layout choice turned our 12-minute job into a &lt;STRONG&gt;2-hour nightmare&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;The root cause? A catastrophic &lt;STRONG&gt;small file explosion&lt;/STRONG&gt; (creating 2.7 million partitions and 3.2 million tiny files) that completely drowned Spark in metadata overhead. Upgrading cluster sizes, running standard OPTIMIZE, and trying ZORDER barely made a dent because Spark was spending all its time just navigating physical directories.&lt;/P&gt;&lt;P&gt;We ultimately solved this by migrating completely to &lt;STRONG&gt;Delta Lake's Liquid Clustering&lt;/STRONG&gt;, which slashed our file count down to 18,000, removed directory overhead entirely, and dropped our total pipeline runtime down to just &lt;STRONG&gt;8 minutes&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;I've shared the full, step-by-step optimization journey, including our exact benchmarking numbers for each failed attempt,&amp;nbsp; over on Medium.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-unicode-emoji" title=":backhand_index_pointing_right:"&gt;👉&lt;/span&gt; &lt;STRONG&gt;&lt;A href="https://medium.com/@avinash.narala6814/how-a-databricks-partitioning-mistake-turned-a-12-minute-databricks-job-into-a-2-hour-nightmare-d51765126d29" target="_blank" rel="noopener"&gt;Read the Full Story on Medium&lt;/A&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;H3&gt;The Results&lt;/H3&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;STRONG&gt;Metric&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;Traditional Partitioning&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;Liquid Clustering&lt;/STRONG&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;SPAN&gt;&lt;STRONG&gt;Total Files&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/TD&gt;&lt;TD&gt;&lt;SPAN&gt;3.2 Million&lt;/SPAN&gt;&lt;/TD&gt;&lt;TD&gt;&lt;SPAN&gt;&lt;STRONG&gt;18,000&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;SPAN&gt;&lt;STRONG&gt;Partition Directories&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/TD&gt;&lt;TD&gt;&lt;SPAN&gt;2.7 Million&lt;/SPAN&gt;&lt;/TD&gt;&lt;TD&gt;&lt;SPAN&gt;&lt;STRONG&gt;0&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;SPAN&gt;&lt;STRONG&gt;Pipeline Runtime&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/TD&gt;&lt;TD&gt;&lt;SPAN&gt;~120 minutes&lt;/SPAN&gt;&lt;/TD&gt;&lt;TD&gt;&lt;SPAN&gt;&lt;STRONG&gt;8 minutes&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;H3&gt;Key Takeaway&lt;/H3&gt;&lt;P&gt;The old rule of &lt;I&gt;"partition by the column you filter on"&lt;/I&gt; fails spectacularly on high-cardinality keys like IDs. If you are facing massive metadata overhead or slow merges, skip the cluster upgrades and switch to &lt;STRONG&gt;Liquid Clustering&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;&lt;I&gt;Have you run into similar small-file bottlenecks in your production environment? Let's discuss below!&lt;/I&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Jun 2026 16:13:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/how-a-partitioning-mistake-turned-a-12-minute-databricks-job/m-p/159918#M1290</guid>
      <dc:creator>Avinash_Narala</dc:creator>
      <dc:date>2026-06-19T16:13:36Z</dc:date>
    </item>
    <item>
      <title>Foresight — The Third Temporal Dimension of Delta Lake</title>
      <link>https://community.databricks.com/t5/community-articles/foresight-the-third-temporal-dimension-of-delta-lake/m-p/159916#M1289</link>
      <description>&lt;P&gt;Delta Lake gives you time travel backward:&lt;/P&gt;&lt;P&gt;SELECT * FROM sales TIMESTAMP AS OF '2026-01-01'&lt;/P&gt;&lt;P&gt;But what about forward? What will your Delta table look like next month?&lt;/P&gt;&lt;P&gt;Nobody has built probabilistic future queries as a first-class Delta concept — until now.&lt;/P&gt;&lt;HR /&gt;&lt;P&gt;&lt;STRONG&gt;Introducing Delta Foresight&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;GitHub: &lt;A href="https://github.com/HarshalSant/delta-foresight" rel="noopener noreferrer" target="_blank"&gt;https://github.com/HarshalSant/delta-foresight&lt;/A&gt; Install: pip install delta-foresight&lt;/P&gt;&lt;P&gt;from delta_foresight import DeltaForesight&lt;/P&gt;&lt;P&gt;df = DeltaForesight(table="catalog.schema.daily_sales", time_column="sale_date", spark=spark) df.fit() forecast = df.predict(as_of="2026-09-01", confidence=0.90) df.materialize("catalog.delta_foresight.daily_sales_forecast")&lt;/P&gt;&lt;P&gt;Then query it with SQL: SELECT ds, revenue_forecast, revenue_lower_90, revenue_upper_90 FROM catalog.delta_foresight.daily_sales_forecast&lt;/P&gt;&lt;HR /&gt;&lt;P&gt;&lt;STRONG&gt;What Makes It Different&lt;/STRONG&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;The forecast IS a Delta table — not a report, not an export. Governed by Unity Catalog, queryable with SQL, shareable via Delta Sharing.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Mathematically valid prediction intervals — uses conformal prediction with proven coverage guarantees. Ask for 90%, get 90%.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Learns your table's temporal DNA — auto-detects frequency, trend, and seasonality from your own Delta history. No manual setup.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;MLflow tracking built in — every forecast run logged automatically.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Works inside and outside Databricks — PySpark on Databricks, delta-rs locally, Parquet fallback for dev.&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;HR /&gt;&lt;P&gt;&lt;STRONG&gt;Use Cases&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Revenue planning: predict month-end close&lt;/LI&gt;&lt;LI&gt;ML model health: forecast when AUC will breach threshold&lt;/LI&gt;&lt;LI&gt;Inventory: predict stock depletion date&lt;/LI&gt;&lt;LI&gt;Cost management: forecast DBU burn rate&lt;/LI&gt;&lt;LI&gt;Data quality: forecast null rate trajectory&lt;/LI&gt;&lt;/UL&gt;&lt;HR /&gt;&lt;P&gt;&lt;STRONG&gt;CLI&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;foresight predict --table catalog.schema.daily_sales --as-of 2026-09-01 foresight fingerprint --table catalog.schema.daily_sales foresight serve # REST API at localhost:8080/docs&lt;/P&gt;&lt;HR /&gt;&lt;P&gt;Feedback welcome — which use case matters most to your team? GitHub Issues: &lt;A href="https://github.com/HarshalSant/delta-foresight/issues" rel="noopener noreferrer" target="_blank"&gt;https://github.com/HarshalSant/delta-foresight/issues&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Also author of vigil-ml:&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.linkedin.com/in/harshalsant0" target="_blank"&gt;https://www.linkedin.com/in/harshalsant0&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://github.com/HarshalSant/vigil" rel="noopener noreferrer" target="_blank"&gt;https://github.com/HarshalSant/&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Jun 2026 15:52:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/foresight-the-third-temporal-dimension-of-delta-lake/m-p/159916#M1289</guid>
      <dc:creator>harsh0610</dc:creator>
      <dc:date>2026-06-19T15:52:49Z</dc:date>
    </item>
    <item>
      <title>DAIS 2026: The Databricks Announcements I Think Clients Should Pay Attention To</title>
      <link>https://community.databricks.com/t5/community-articles/dais-2026-the-databricks-announcements-i-think-clients-should/m-p/159914#M1288</link>
      <description>&lt;P class=""&gt;The most important thing I took away from Data + AI Summit 2026 was not one product announcement.&lt;/P&gt;&lt;P class=""&gt;It was the direction.&lt;/P&gt;&lt;P class=""&gt;Databricks is building around a very real enterprise problem: companies want AI to help with decisions, operations, customer engagement, software development, security, and analytics, but the AI has to work inside the reality of the business.&lt;/P&gt;&lt;P class=""&gt;That reality includes messy data, strict permissions, different definitions of the same metric, pipelines that break, models that drift, sensitive customer data, many clouds, many tools, and teams that already have enough platforms to manage.&lt;/P&gt;&lt;P class=""&gt;This is why I found this year’s announcements interesting. They were not only about adding more capability. They were about reducing the distance between data, context, AI, governance, and action.&lt;/P&gt;&lt;H3 id="ember353"&gt;Context is becoming the real AI foundation&lt;/H3&gt;&lt;P class=""&gt;The announcement around &lt;STRONG&gt;Genie One, Genie Agents, and Genie Ontology&lt;/STRONG&gt; was one of the strongest signals from the summit.&lt;/P&gt;&lt;P class=""&gt;The reason is simple. A business user does not need another generic chatbot. They need an AI experience that understands how their company works.&lt;/P&gt;&lt;P class=""&gt;In most organizations, the business meaning of data is spread across dashboards, SQL queries, notebooks, pipelines, documents, wikis, tickets, and team knowledge. A table may be accurate, but the real definition of the metric may live somewhere else. A dashboard may be popular, but not always certified. A calculation may be used in production, but not documented clearly.&lt;/P&gt;&lt;P class=""&gt;This is the gap Genie Ontology is trying to close.&lt;/P&gt;&lt;P class=""&gt;The interesting part is not only that Genie can answer questions. The interesting part is that Genie can use business context, source authority, freshness, usage, relationships, permissions, and trusted definitions to decide how to answer. That is the difference between an AI answer that sounds right and an AI answer that the business can trust.&lt;/P&gt;&lt;P class=""&gt;Genie One then puts that experience where people work: data, apps, Slack, Teams, mobile, MCP-based experiences, and agent workflows. Genie Agents extend it further by letting teams create domain-specific agents grounded in the same trusted context.&lt;/P&gt;&lt;P class=""&gt;For clients, this is a major point. AI accuracy will not come only from better models. It will come from giving the model the right business context, close to the governed data.&lt;/P&gt;&lt;H3 id="ember361"&gt;Agent engineering is becoming a platform problem&lt;/H3&gt;&lt;P class=""&gt;Agent Bricks and Omnigent were also important to me because they address what many teams are starting to learn.&lt;/P&gt;&lt;P class=""&gt;Building an agent demo is easy. Running agents safely at enterprise scale is not.&lt;/P&gt;&lt;P class=""&gt;Databricks made a very useful point in the Agent Bricks announcement: the core agent loop is only a small part of the work. The hard parts are token capacity, deployment, security, evaluation, monitoring, context, sharing, memory, cost control, and safe execution.&lt;/P&gt;&lt;P class=""&gt;That matches what I see with clients. The excitement around agents is real, but the operating model is still immature. Teams are using different coding agents, different models, different harnesses, different prompts, and different security patterns. That works for experimentation. It does not scale cleanly.&lt;/P&gt;&lt;P class=""&gt;This is where &lt;STRONG&gt;Agent Bricks&lt;/STRONG&gt; becomes relevant. It is moving from agent building into a broader agent platform, with model choice, secure sandboxes, memory, skills, MCP support, evaluation, governance, and token controls.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Omnigent&lt;/STRONG&gt; is also a smart move. Enterprises are not going to use only one coding assistant or one framework. They will use Claude Code, Codex, custom agents, internal tools, and new tools that are not even popular yet. A meta-harness gives teams a way to compose, control, and share agent workflows without locking everything to one tool.&lt;/P&gt;&lt;P class=""&gt;The managed Omnigent direction on Databricks is especially practical: shared history, remote access, collaboration, isolated execution, and governance through Unity AI Gateway.&lt;/P&gt;&lt;P class=""&gt;My view is that agent development is about to look more like software engineering and platform engineering. The teams that treat agents only as prompts will struggle. The teams that treat agents as governed systems will move faster and with less risk.&lt;/P&gt;&lt;H3 id="ember370"&gt;ZeroOps is one of the most practical announcements&lt;/H3&gt;&lt;P class=""&gt;I liked &lt;STRONG&gt;Genie ZeroOps&lt;/STRONG&gt; because it is close to the daily pain of data and ML teams.&lt;/P&gt;&lt;P class=""&gt;Anyone who has worked on a production data platform knows this pattern. A pipeline fails. A schema changes. A table looks fine but the data quality has silently changed. A dashboard number moves and nobody immediately knows whether it is a real business change or a data issue. A model starts producing weaker predictions without throwing an error.&lt;/P&gt;&lt;P class=""&gt;A general coding agent can help write code, but data and AI operations need more than code. They need lineage, logs, telemetry, platform events, data quality signals, job history, permissions, and safe validation against real data.&lt;/P&gt;&lt;P class=""&gt;That is why the ZeroOps flow is useful: detect, assess, remediate, and verify.&lt;/P&gt;&lt;P class=""&gt;The verify step is the part I care about most. Proposed fixes can be tested in a secure sandbox using zero-copy clones, scoped permissions, and isolation before anything touches production. That is a practical enterprise pattern. It keeps people in control while cutting down the time spent on investigation and root-cause analysis.&lt;/P&gt;&lt;P class=""&gt;For ML, this becomes even more important. A model can be technically “up” and still be wrong. Genie ZeroOps for ML can help investigate drift, serving errors, pipeline problems, and production performance issues. As more teams use AI to build more models and pipelines, this operational layer becomes necessary.&lt;/P&gt;&lt;H3 id="ember377"&gt;Real-time is moving closer to the lakehouse&lt;/H3&gt;&lt;P class=""&gt;&lt;STRONG&gt;Lakehouse//RT&lt;/STRONG&gt;, &lt;STRONG&gt;Lakebase&lt;/STRONG&gt;, &lt;STRONG&gt;Lakeflow&lt;/STRONG&gt;, and &lt;STRONG&gt;LTAP&lt;/STRONG&gt; all connect to a long-running architecture issue.&lt;/P&gt;&lt;P class=""&gt;Many companies still use separate systems for transactions, analytics, streaming, serving, applications, and AI. This creates copies of data, sync jobs, governance gaps, and additional places where things can fail.&lt;/P&gt;&lt;P class=""&gt;Lakehouse//RT is Databricks’ answer for real-time operational analytics, BI, app serving, and observability workloads directly on the lakehouse. The message I liked from the Lakehouse//RT announcement is that separate serving layers have a real cost: duplication, governance drift, and engineering overhead.&lt;/P&gt;&lt;P class=""&gt;Lakehouse//RT, powered by Reyden, is aimed at millisecond performance without moving data away from the lakehouse. The benchmark numbers are impressive, but the architecture point is more important to me. If teams can serve real-time apps, dashboards, and agent workflows from the same governed data foundation, they reduce a lot of unnecessary complexity.&lt;/P&gt;&lt;P class=""&gt;LTAP goes in the same direction. Lakebase supports transactional workloads. Lakeflow supports ingestion, transformation, orchestration, and pipeline development. Together, they bring transactional and analytical processing closer to the governed lakehouse.&lt;/P&gt;&lt;P class=""&gt;This is very relevant for AI. Agents need current data. Customer experiences need current data. Fraud, supply chain, finance, security, and operations use cases need current data. If data is delayed or copied across too many systems, AI becomes less useful and harder to trust.&lt;/P&gt;&lt;H3 id="ember384"&gt;Governance has moved into the AI runtime&lt;/H3&gt;&lt;P class=""&gt;The Unity Catalog and Unity AI Gateway announcements may be less flashy than agents, but they are extremely important.&lt;/P&gt;&lt;P class=""&gt;Governance is changing. It is no longer only about who can query a table or access a dashboard. Agents can call tools, invoke MCP servers, write code, generate artifacts, trigger workflows, and act across systems. That means governance has to follow the AI interaction itself.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Unity AI Gateway&lt;/STRONG&gt; is important because it extends governance into models, agents, MCP services, skills, tools, cost controls, routing, monitoring, and runtime policy enforcement.&lt;/P&gt;&lt;P class=""&gt;The partner ecosystem around Unity AI Gateway also matters. Databricks is integrating with AI security, identity, observability, DLP, runtime guardrail, and agent governance providers. That is important because large companies already have security and identity tools. AI governance cannot live in a separate island.&lt;/P&gt;&lt;P class=""&gt;I also paid attention to the security and compliance announcements: Automatic Identity Management for Entra ID, Okta support in preview, Context-Based Ingress, Private Network Gateway, Lakebase private connectivity, HITRUST across clouds, expanded GovCloud support, and FedRAMP High support coming on Azure Commercial.&lt;/P&gt;&lt;P class=""&gt;This is the work that makes AI usable in regulated environments. It may not get the loudest applause, but clients will care about it when they move from pilots to production.&lt;/P&gt;&lt;H3 id="ember391"&gt;Apps, Marketplace, and OpenSharing show a broader ecosystem play&lt;/H3&gt;&lt;P class=""&gt;The Apps, Marketplace, and OpenSharing announcements were also meaningful.&lt;/P&gt;&lt;P class=""&gt;Databricks Apps is becoming more important because many useful enterprise solutions are small and very specific: an operations portal, a workflow manager, a data quality review app, an internal AI assistant, a model interface, or a business process app. These apps often get delayed because of infrastructure, cost, security review, or unclear ownership.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;App Spaces&lt;/STRONG&gt; gives admins a way to define access, resources, API scopes, and security policies for groups of apps. &lt;STRONG&gt;Genie App Builder&lt;/STRONG&gt; helps teams build apps with awareness of Databricks data, Unity Catalog semantics, and workspace context. &lt;STRONG&gt;Serverless Micro Apps&lt;/STRONG&gt; make the economics better for apps that are useful but not always running.&lt;/P&gt;&lt;P class=""&gt;This is a good pattern: let the people closest to the business problem build, but do it inside a governed boundary.&lt;/P&gt;&lt;P class=""&gt;Marketplace and OpenSharing extend this to partners and data providers.&lt;/P&gt;&lt;P class=""&gt;The Marketplace commit drawdown and upcoming transactability are important for commercial adoption. Partners can reach Databricks customers more directly and shorten sales cycles by using pre-committed spend. Apps and Genie Agents can also be distributed through Marketplace, which opens new packaging models.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;OpenSharing&lt;/STRONG&gt; is the larger architecture move. Delta Sharing was about open, zero-copy data sharing. OpenSharing extends that idea to the agentic era: structured data, unstructured data, models, skills, semantics, and Genie Agents across clouds, platforms, and organizations.&lt;/P&gt;&lt;P class=""&gt;The ability to share a Genie Agent is very interesting. A provider can share an AI experience over their data without forcing the customer to learn the schema, build a UI, or access every underlying table. That can change how proprietary data providers package their value.&lt;/P&gt;&lt;P class=""&gt;This is where “pay per question” becomes more than a marketing idea. A data provider could let customers ask governed natural-language questions against proprietary data, with limits on prompts, rows, and access. That is a very different commercial model from traditional data licensing.&lt;/P&gt;&lt;H3 id="ember401"&gt;CustomerLake is a good example of AI moving into business workflows&lt;/H3&gt;&lt;P class=""&gt;&lt;STRONG&gt;CustomerLake&lt;/STRONG&gt; caught my attention because it shows how Databricks is moving closer to business functions, not only technical teams.&lt;/P&gt;&lt;P class=""&gt;Customer data is one of the hardest areas in any company. It is sensitive, duplicated, fragmented, and constantly changing. Traditional CDPs helped marketers activate customer data, but they often created another platform outside the governed data foundation.&lt;/P&gt;&lt;P class=""&gt;CustomerLake takes a different approach by embedding the CDP into Databricks.&lt;/P&gt;&lt;P class=""&gt;The idea of &lt;STRONG&gt;Golden Context&lt;/STRONG&gt; is important. A customer profile is useful, but it is not enough. The AI also needs business goals, live signals, channel context, past decisions, and what has already been tried with that customer.&lt;/P&gt;&lt;P class=""&gt;The idea of &lt;STRONG&gt;Infinity Campaigns&lt;/STRONG&gt; is also interesting. Instead of static campaigns and large segments, the direction is always-on, real-time, 1:1 engagement where agents help adapt timing, message, and channel based on current context.&lt;/P&gt;&lt;P class=""&gt;This will be a major discussion for marketing, customer experience, and data teams. The CDP conversation is moving closer to the data foundation, governance model, and AI architecture.&lt;/P&gt;&lt;H3 id="ember408"&gt;ML is becoming more native to the platform&lt;/H3&gt;&lt;P class=""&gt;The AI Platform announcements also had a strong practical angle.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Genie Code for ML&lt;/STRONG&gt; is useful because ML work is not only writing Python. It includes feature engineering, experiment tracking, evaluation, model registration, deployment, serving, monitoring, drift analysis, and retraining. A generic coding agent will not understand the full ML lifecycle unless it is connected to the platform context.&lt;/P&gt;&lt;P class=""&gt;Genie Code integrates with Unity Catalog, Feature Store, MLflow, AI Runtime, Model Serving, Inference Tables, and production observability. That context is what makes it more useful for ML teams.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;AI Runtime&lt;/STRONG&gt; is another important step. Serverless A10 and H100 GPUs, multinode training, Lakeflow Jobs support, MLflow observability, and Unity Catalog governance help teams train and fine-tune models without spending so much time on GPU infrastructure.&lt;/P&gt;&lt;P class=""&gt;The real-time ML announcements also matter: streaming features, declarative feature engineering, online feature serving on Lakebase, and high-QPS Model Serving. These are the capabilities that support fraud detection, recommendations, personalization, search, and other low-latency production use cases.&lt;/P&gt;&lt;P class=""&gt;The platform direction is clear: ML should not feel like a separate stack attached to the lakehouse. It should be part of the same governed system.&lt;/P&gt;&lt;H3 id="ember415"&gt;My takeaway&lt;/H3&gt;&lt;P class=""&gt;DAIS 2026 showed Databricks moving toward a more complete operating foundation for enterprise AI.&lt;/P&gt;&lt;P class=""&gt;The common thread I saw across the announcements was this:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Bring context closer to data.&lt;/LI&gt;&lt;LI&gt;Bring governance closer to AI behavior.&lt;/LI&gt;&lt;LI&gt;Bring real-time and transactional workloads closer to the lakehouse.&lt;/LI&gt;&lt;LI&gt;Bring agents closer to safe engineering practices.&lt;/LI&gt;&lt;LI&gt;Bring apps and business workflows closer to the governed data foundation.&lt;/LI&gt;&lt;LI&gt;Bring operations closer to automation, but keep humans in control where it matters.&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;That is a very practical direction.&lt;/P&gt;&lt;P class=""&gt;For clients, the next phase of AI will not be won by creating more disconnected AI agents. It will be won by building a foundation where AI can understand the business, access trusted data, follow governance, work with current context, and take action safely.&lt;/P&gt;&lt;P class=""&gt;That is why I think the DAIS 2026 announcements are worth paying attention to.&lt;/P&gt;&lt;P class=""&gt;They show Databricks moving closer to how real companies actually need AI to work.&lt;/P&gt;</description>
      <pubDate>Fri, 19 Jun 2026 15:49:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/dais-2026-the-databricks-announcements-i-think-clients-should/m-p/159914#M1288</guid>
      <dc:creator>mou</dc:creator>
      <dc:date>2026-06-19T15:49:14Z</dc:date>
    </item>
    <item>
      <title>Evaluating GenAI Applications the Right Way in Databricks Ecosystem</title>
      <link>https://community.databricks.com/t5/community-articles/evaluating-genai-applications-the-right-way-in-databricks/m-p/158589#M1286</link>
      <description>&lt;P&gt;Hello Everyone!&amp;nbsp;&lt;/P&gt;&lt;P&gt;I've been spending a lot of time lately thinking about something that keeps coming up in almost every GenAI project I touch — how do you actually know if your model is working well? Not just in demos, but in production, day after day.&lt;/P&gt;&lt;P&gt;So I sat down and jotted down some of my learnings around effective model evaluation techniques for GenAI applications using the Databricks ecosystem. What does good evaluation actually look like? Why do your old ML metrics (Precision, Recall, MAE, MAPE) still matter more than you think? And how do you build a continuous eval loop that catches problems before your users do?&lt;/P&gt;&lt;P&gt;&lt;A href="https://medium.com/@vinu2433/evaluating-genai-applications-the-right-way-4def3276018e?postPublishedType=initial" target="_blank"&gt;https://medium.com/@vinu2433/evaluating-genai-applications-the-right-way-4def3276018e?postPublishedType=initial&lt;/A&gt;&lt;/P&gt;&lt;P&gt;This blog walks through the full evaluation stack — from classification and regression metrics on your retrieval and extraction layers, all the way to LLM-as-a-judge and RAG-specific metrics like faithfulness and context recall — with real Databricks code and MLflow integration throughout.&lt;/P&gt;&lt;P&gt;In upcoming posts, we'll go deeper into prompt engineering strategies, production monitoring patterns, and building eval pipelines at scale on Databricks. Stay tuned, and I'd love to hear how your teams are approaching evals in the comments below!&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 09 Jun 2026 05:12:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/evaluating-genai-applications-the-right-way-in-databricks/m-p/158589#M1286</guid>
      <dc:creator>vinaychavan</dc:creator>
      <dc:date>2026-06-09T05:12:05Z</dc:date>
    </item>
    <item>
      <title>What Happens When a Data Platform Starts Making Its Own Decisions?</title>
      <link>https://community.databricks.com/t5/community-articles/what-happens-when-a-data-platform-starts-making-its-own/m-p/158605#M1285</link>
      <description>&lt;P class=""&gt;Data platforms are getting smarter — but are we asking the right questions about what that means for data engineering?&lt;/P&gt;&lt;P class=""&gt;I wrote about how Databricks Predictive Optimization is shifting the role of data engineers from reactive maintenance to autonomous operations. The article covers:&lt;/P&gt;&lt;P class=""&gt;&lt;span class="lia-unicode-emoji" title=":small_blue_diamond:"&gt;🔹&lt;/span&gt; Why optimization becomes a visibility problem at enterprise scale &lt;span class="lia-unicode-emoji" title=":small_blue_diamond:"&gt;🔹&lt;/span&gt; How Predictive Optimization actually works under the hood &lt;span class="lia-unicode-emoji" title=":small_blue_diamond:"&gt;🔹&lt;/span&gt; Z-Ordering vs Liquid Clustering — and how Predictive Optimization handles both differently &lt;span class="lia-unicode-emoji" title=":small_blue_diamond:"&gt;🔹&lt;/span&gt; Where automation wins, and where engineering judgment still matters&lt;/P&gt;&lt;P class=""&gt;The question I keep coming back to: &lt;STRONG&gt;how much of data engineering optimization will remain manual five years from now?&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;Would love to hear how others in the community are approaching this — especially those who have already enabled Predictive Optimization at scale.&lt;/P&gt;&lt;P class=""&gt;&lt;span class="lia-unicode-emoji" title=":backhand_index_pointing_right:"&gt;👉&lt;/span&gt; [&lt;A href="https://medium.com/p/37a11392d2af?postPublishedType=initial" target="_blank" rel="noopener"&gt;https://medium.com/p/37a11392d2af?postPublishedType=initial&lt;/A&gt;]&lt;/P&gt;</description>
      <pubDate>Tue, 09 Jun 2026 08:25:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/what-happens-when-a-data-platform-starts-making-its-own/m-p/158605#M1285</guid>
      <dc:creator>balajiselvarasu</dc:creator>
      <dc:date>2026-06-09T08:25:42Z</dc:date>
    </item>
    <item>
      <title>Z-Ordering VS Liquid Clustering</title>
      <link>https://community.databricks.com/t5/community-articles/z-ordering-vs-liquid-clustering/m-p/158596#M1284</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I recently published a technical blog on Z-Ordering vs Liquid Clustering in Delta Lake, covering the internals of both techniques in detail.&lt;/P&gt;&lt;P&gt;Rather than focusing only on the syntax, the blog goes deeper into:&lt;/P&gt;&lt;P&gt;&amp;nbsp;The origin of Z-Ordering — tracing back to Morton's space-filling curve from 1966&lt;BR /&gt;&amp;nbsp;How bit interleaving works and how Z-values are computed&lt;BR /&gt;&amp;nbsp;Why Z-Ordering degrades over time on live, continuously loaded tables&lt;BR /&gt;&amp;nbsp;How Liquid Clustering addresses this using the Hilbert Curve&lt;BR /&gt;&amp;nbsp;What CLUSTER BY AUTO does internally — query pattern tracking and frequency scoring&lt;BR /&gt;&amp;nbsp;Why partitioning and Liquid Clustering are incompatible per the official documentation&lt;BR /&gt;&amp;nbsp;Practical recommendations from real Supply Chain ATP implementations&lt;/P&gt;&lt;P&gt;The blog is built around a Supply Chain Available to Promise scenario to keep the concepts grounded in a real-world context.&lt;/P&gt;&lt;P&gt;Would appreciate any feedback or thoughts from the community.&lt;/P&gt;&lt;P&gt;&lt;A href="https://medium.com/@pmanoj0104/z-ordering-vs-liquid-clustering-a79a12ad0038" target="_blank" rel="noopener"&gt;https://medium.com/@pmanoj0104/z-ordering-vs-liquid-clustering-a79a12ad0038&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 09 Jun 2026 07:48:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/z-ordering-vs-liquid-clustering/m-p/158596#M1284</guid>
      <dc:creator>ManojSampath</dc:creator>
      <dc:date>2026-06-09T07:48:40Z</dc:date>
    </item>
    <item>
      <title>Lakeflow Connect: Managed Ingestion Without the Pipeline Tax</title>
      <link>https://community.databricks.com/t5/community-articles/lakeflow-connect-managed-ingestion-without-the-pipeline-tax/m-p/158619#M1283</link>
      <description>&lt;P&gt;I recently published a piece on Lakeflow Connect and wanted to share it here since this community is where the conversation actually happens.&lt;/P&gt;&lt;P&gt;The post covers something most of us have lived through, the hidden cost of maintaining ingestion pipelines. The Fivetran subscription, the S3 landing zone, the Airflow DAG, the custom CDC merge logic, the monitoring stack, five vendors and 1,200 lines of code just to get data from Salesforce into Delta.&lt;/P&gt;&lt;P&gt;Lakeflow Connect collapses that into one declarative resource inside Databricks. I broke down:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;What changes architecturally when you migrate, including the before/after diff&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;How log-based CDC and schema evolution are handled natively&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Where Lakeflow Connect fits, and where it doesn’t, since streaming with sub-second latency still belongs in Structured Streaming&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;What this means for data teams thinking about headcount and tool consolidation&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Full post on Medium:&lt;BR /&gt;&lt;A href="https://medium.com/@sporwal8989/lakeflow-connect-managed-ingestion-without-the-pipeline-tax-1d5fd74d516f" target="_blank"&gt;https://medium.com/@sporwal8989/lakeflow-connect-managed-ingestion-without-the-pipeline-tax-1d5fd74d516f&lt;/A&gt;&lt;/P&gt;&lt;P&gt;A few things I’d love to discuss with this community:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;For teams that have already migrated, what was the most painful part of the cutover?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;The connector catalog is growing fast but isn’t universal yet. What sources do you wish were supported that aren’t?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;How are you handling the gap between Lakeflow Connect’s incremental ingestion and use cases that still need sub-second latency?&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Curious to hear what others have seen.&lt;/P&gt;&lt;P&gt;Thanks for reading.&lt;/P&gt;</description>
      <pubDate>Tue, 09 Jun 2026 11:05:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/lakeflow-connect-managed-ingestion-without-the-pipeline-tax/m-p/158619#M1283</guid>
      <dc:creator>SHIVAMPORWAL</dc:creator>
      <dc:date>2026-06-09T11:05:43Z</dc:date>
    </item>
    <item>
      <title>Your data is clean. But who's accessing it, and how? Governing your Lakehouse with Unity Catalog</title>
      <link>https://community.databricks.com/t5/community-articles/your-data-is-clean-but-who-s-accessing-it-and-how-governing-your/m-p/159424#M1282</link>
      <description>&lt;P class=""&gt;Nobody told the analytics team they couldn't query the raw customer table. So, they did.&lt;/P&gt;&lt;P class=""&gt;Full names, email addresses, phone numbers - exported to a CSV for "a quick look." No alert fired. No one flagged it. We found out three weeks later during a compliance review.&lt;/P&gt;&lt;P class=""&gt;The pipeline was solid. Months of work. Clean transformations, reliable runs, well-structured tables. We just never thought about who could actually reach in and pull the data out.&lt;/P&gt;&lt;P class=""&gt;That's the part that gets skipped. You build a great pipeline and assume the work is done. But in retail - where you're sitting on customer PII, order history, and payment data - the pipeline is only half the job. The other half is knowing exactly who can see what, where data came from, and having an answer ready when compliance asks.&lt;/P&gt;&lt;P class=""&gt;"We trust our team" doesn't hold up in an audit.&lt;/P&gt;&lt;P class=""&gt;Unity Catalog is what fixed this for us. I've been running it in production for about a year, and the biggest change isn't a feature - it's that access control stops being something you bolt on after the pipelines are built and becomes part of how the platform works. This post covers the three things I use most: &lt;STRONG&gt;PII tagging&lt;/STRONG&gt;, &lt;STRONG&gt;lineage tracking&lt;/STRONG&gt;, and &lt;STRONG&gt;row-level security&lt;/STRONG&gt;. All with working SQL you can adapt directly.&lt;/P&gt;&lt;H2&gt;&lt;BR /&gt;A quick picture of what Unity Catalog does&lt;/H2&gt;&lt;P&gt;Unity Catalog introduces a three-level namespace on top of your existing Databricks workspace:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;catalog
  └── schema (database)
        └── table / view / volume&lt;/LI-CODE&gt;&lt;P&gt;So instead of &lt;FONT color="#0000FF"&gt;database.table&lt;/FONT&gt;, you now reference &lt;FONT color="#0000FF"&gt;catalog.schema.table&lt;/FONT&gt;&amp;nbsp;- something like &lt;FONT color="#0000FF"&gt;retail_prod.sales.orders&lt;/FONT&gt;. This might feel like extra typing at first, but it's what makes centralized governance possible - a single catalog with one permission model covering all your workspaces.&lt;/P&gt;&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;&lt;H2&gt;Tagging PII columns - know what you're carrying&lt;/H2&gt;&lt;P&gt;The first thing I did when setting up Unity Catalog was tag every column that carries personally identifiable information. Not because anyone asked me to - just because it's harder to lock down data you haven't mapped.&lt;/P&gt;&lt;P&gt;UC has a built-in tag system - catalog, schema, table, or column level. For PII I go straight to column level -&amp;nbsp;it's the most precise, and it gives you a queryable inventory of sensitive fields across your entire platform.&lt;/P&gt;&lt;P&gt;Here's how to create and assign a PII tag:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;-- sql

-- Create a tag in your catalog
CREATE TAG IF NOT EXISTS pii_category  ALLOWED_VALUES 'name', 'email', 'phone', 'address', 'payment';

-- Apply tags to sensitive columns
ALTER TABLE retail_prod.customers.profiles ALTER COLUMN full_name SET TAGS ('pii_category' = 'name');

ALTER TABLE retail_prod.customers.profiles ALTER COLUMN email_address SET TAGS ('pii_category' = 'email');

ALTER TABLE retail_prod.customers.profiles ALTER COLUMN phone_number SET TAGS ('pii_category' = 'phone');

ALTER TABLE retail_prod.orders.transactions
  ALTER COLUMN card_last_four SET TAGS ('pii_category' = 'payment');&lt;/LI-CODE&gt;&lt;P&gt;Once tagged, this query gives you a full map of every sensitive column across your platform:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;-- sql

SELECT
  table_catalog,
  table_schema,
  table_name,
  column_name,
  tag_value AS pii_type
FROM system.information_schema.column_tags
WHERE tag_name = 'pii_category'
ORDER BY table_catalog, table_schema, table_name;&lt;/LI-CODE&gt;&lt;P&gt;Run that and you have a full inventory of sensitive columns. The first time compliance asked us where customer email appeared across the platform, that query answered it in about ten seconds.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Worth being clear on though&lt;/STRONG&gt;: tags are metadata only - they don't restrict access by themselves. What they do is give you the foundation to build policies on top of, which is where the next two sections come in.&lt;/P&gt;&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;&lt;H2&gt;Lineage - where did this data come from?&lt;/H2&gt;&lt;P&gt;Every time a Delta Live Tables pipeline runs - or any notebook, job, or SQL query that reads from and writes to UC-managed tables - Unity Catalog automatically captures the data flow. No configuration required. You get column-level lineage out of the box.&lt;/P&gt;&lt;P&gt;The real pain shows up when an analyst reports the &lt;FONT color="#0000FF"&gt;lifetime_value&lt;/FONT&gt; column looks off. Without lineage you're manually tracing through notebooks and pipeline code trying to figure out what fed it. With lineage, you open Catalog Explorer, click the column, and see the exact chain - Silver table, Bronze table, raw source file. Done in seconds instead of an hour.&lt;/P&gt;&lt;P&gt;The same lineage is queryable directly if you need it in a script or dashboard:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;-- sql

-- Find all upstream tables feeding into the Gold customer table
SELECT
  source_table_full_name,
  target_table_full_name,
  created_at
FROM system.access.table_lineage
WHERE target_table_full_name = 'retail_prod.gold.customer_lifetime_value'
ORDER BY created_at DESC;&lt;/LI-CODE&gt;&lt;P&gt;Lineage also works in reverse. If you're about to deprecate a Bronze table, you can check what downstream assets depend on it before you touch anything:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;-- sql

-- What tables depend on our Bronze orders table?
SELECT
  source_table_full_name,
  target_table_full_name
FROM system.access.table_lineage
WHERE source_table_full_name = 'retail_prod.bronze.raw_orders'
ORDER BY target_table_full_name;&lt;/LI-CODE&gt;&lt;P&gt;No more "I deleted a table and three pipelines broke" incidents.&lt;/P&gt;&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;&lt;H2&gt;Fine-grained access control - who sees what&lt;/H2&gt;&lt;P&gt;&lt;STRONG&gt;Permissions in UC are just SQL&lt;/STRONG&gt;. You grant privileges at any level of the hierarchy - catalog, schema, or table - and they inherit downward:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;-- sql

-- Analytics team gets read access to Gold only
GRANT USE CATALOG ON CATALOG retail_prod TO `analytics-team`;
GRANT USE SCHEMA ON SCHEMA retail_prod.gold TO `analytics-team`;
GRANT SELECT ON SCHEMA retail_prod.gold TO `analytics-team`;

-- Data engineers get full access to Bronze and Silver
GRANT ALL PRIVILEGES ON SCHEMA retail_prod.bronze TO `data-engineering`;
GRANT ALL PRIVILEGES ON SCHEMA retail_prod.silver TO `data-engineering`;

-- Block direct access to raw PII table
REVOKE SELECT ON TABLE retail_prod.customers.profiles FROM `analytics-team`;&lt;/LI-CODE&gt;&lt;P&gt;The analytics team can query all Gold tables. They cannot touch raw Bronze data or the customers PII table directly. Clean separation, enforced at the platform level.&lt;/P&gt;&lt;H3&gt;Row-level security with dynamic views&lt;/H3&gt;&lt;P&gt;Table-level permissions are a blunt instrument. Sometimes you need more precision - regional managers should only see orders from their own region, or a customer support team should only see records for accounts they're assigned to.&lt;/P&gt;&lt;P&gt;The trick is &lt;FONT color="#0000FF"&gt;is_account_group_member()&lt;/FONT&gt;&amp;nbsp;- it checks the caller's group at query time, so the same view returns different rows for different people:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;-- sql

CREATE OR REPLACE VIEW retail_prod.sales.regional_orders AS
SELECT
  order_id,
  order_date,
  customer_id,
  product_sku,
  order_amount,
  region
FROM retail_prod.silver.orders
WHERE
  CASE
    WHEN is_account_group_member('region-apac') THEN region = 'APAC'
    WHEN is_account_group_member('region-emea') THEN region = 'EMEA'
    WHEN is_account_group_member('region-us')   THEN region = 'US'
    WHEN is_account_group_member('data-engineering') THEN TRUE
    ELSE FALSE
  END;

-- Grant access to the view, not the underlying table
GRANT SELECT ON VIEW retail_prod.sales.regional_orders TO `analytics-team`;&lt;/LI-CODE&gt;&lt;P&gt;Same query, same view - different results based on who's asking. The underlying Silver table stays locked down.&lt;/P&gt;&lt;H3&gt;Column masking for PII&lt;/H3&gt;&lt;P&gt;One more pattern I use heavily: column masking. Instead of hiding an entire column, you partially mask it depending on who's querying. We use this for customer support - they need to verify who someone is, but there's no reason they should be able to export a raw email list.&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;-- sql

-- Create a masking policy
CREATE MASKING POLICY retail_prod.security.email_mask AS (email STRING)
  RETURNS STRING -&amp;gt;
    CASE
      WHEN is_account_group_member('data-engineering') THEN email      WHEN is_account_group_member('customer-support')
        THEN CONCAT(LEFT(email, 2), '****', SUBSTRING_INDEX(email, '@', -1))
      ELSE '****@****.***'
    END;

-- Apply the masking policy to the column
ALTER TABLE retail_prod.customers.profiles  ALTER COLUMN email_address  SET MASKING POLICY retail_prod.security.email_mask;&lt;/LI-CODE&gt;&lt;P&gt;A data engineer sees the full email. A customer support agent sees &lt;FONT color="#0000FF"&gt;jo****@gmail.com&lt;/FONT&gt;. Anyone else sees &lt;FONT color="#0000FF"&gt;****@****.***&lt;/FONT&gt;. One table, one policy, three different experiences - and zero copies of the data floating around.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H2&gt;Auditing - who actually accessed what&lt;/H2&gt;&lt;P&gt;There's one more thing most people skip: checking whether any of this is actually being used. Unity Catalog logs every access event to the system audit tables:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;-- sql 

SELECT
  user_identity.email AS user_email,
  action_name,
  request_params.table_full_name AS table_accessed,
  event_time 
FROM system.access.audit 
WHERE request_params.table_full_name = 'retail_prod.customers.profiles'
  AND event_time &amp;gt;= CURRENT_TIMESTAMP - INTERVAL 7 DAYS  AND action_name IN ('SELECT', 'READ')
ORDER BY event_time DESC;&lt;/LI-CODE&gt;&lt;P&gt;Run this weekly, pipe it into a Databricks SQL dashboard, and you have an access audit trail that satisfies most compliance requirements without any custom logging infrastructure.&lt;/P&gt;&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;&lt;H2&gt;The mistakes I'd save you from&lt;/H2&gt;&lt;P&gt;&lt;STRONG&gt;Don't skip the catalog hierarchy design&lt;/STRONG&gt;. Once you have tables in Unity Catalog, restructuring the catalog/schema layout is painful. Spend an hour upfront deciding how you'll organize catalogs - by environment, by domain - before you start creating tables.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Tags alone don't protect data&lt;/STRONG&gt;. I've seen teams tag all their PII columns and then consider the job done. Tags are discovery and documentation - they don't enforce anything. Pair them with masking policies and view-based access control.&lt;/P&gt;&lt;P&gt;&lt;FONT color="#0000FF"&gt;is_account_group_member&lt;/FONT&gt; &lt;STRONG&gt;checks group membership at query time&lt;/STRONG&gt;. This is a feature, not a bug - add a user to a group and their access updates immediately without changing any views or policies. But it also means if you remove someone from a group, they lose access instantly. Make sure your group membership is managed carefully. I'd rather have a slightly annoying offboarding checklist than discover an ex-employee still has access to the customer table three months later.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Test your dynamic views as a non-admin&lt;/STRONG&gt;. It's easy to build a row-level security view, test it as an admin - who often bypasses restrictions - and ship it thinking it works. Always verify by impersonating a user in the target group.&lt;/P&gt;&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;&lt;H2&gt;What this actually changes&lt;/H2&gt;&lt;P&gt;The part that surprised me most wasn't the technical setup - it was how much easier governance conversations became once everything lived in the platform rather than in a spreadsheet. People can't work around it accidentally. Access is granted explicitly, audited automatically, and revoked cleanly.&lt;/P&gt;&lt;P&gt;In retail, where you're handling customer PII and payment data, getting this wrong isn't just a technical problem. It shows up in audits, in compliance reviews, in the conversation nobody wants to have at 9am on a Monday after a data breach.&lt;/P&gt;&lt;P&gt;If I had to do this again from scratch, I'd do it in this order:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Tag PII columns first&lt;/STRONG&gt; - you need to know what you're protecting before you can protect it&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Get lineage working early&lt;/STRONG&gt; - it's much more useful for preventing incidents than for investigating them after the fact&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Build access control into views and masking policies&lt;/STRONG&gt;, not just table-level grants&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Check the audit logs regularly&lt;/STRONG&gt;, even when nothing seems wrong - that's usually when you find something useful&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;If you're setting up Unity Catalog for the first time or migrating from the legacy Hive metastore, start small. Pick one domain, one catalog, get the permissions right, then expand. Trying to govern everything at once leads to over-complicated structures that nobody maintains.&lt;/P&gt;&lt;P&gt;Drop a comment if you've hit any Unity Catalog edge cases in production - particularly around external tables or cross-workspace sharing. Always curious what others have run into.&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jun 2026 10:14:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/your-data-is-clean-but-who-s-accessing-it-and-how-governing-your/m-p/159424#M1282</guid>
      <dc:creator>savlahanish27</dc:creator>
      <dc:date>2026-06-17T10:14:12Z</dc:date>
    </item>
    <item>
      <title>Building Production-Ready SDP Pipelines with Genie Code: The Complete Guide</title>
      <link>https://community.databricks.com/t5/community-articles/building-production-ready-sdp-pipelines-with-genie-code-the/m-p/159195#M1281</link>
      <description>&lt;DIV class="gg gh gi gj gk"&gt;
&lt;DIV class="v cf"&gt;
&lt;DIV class="cm bd fs ft fu fv"&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;&lt;EM class="nb"&gt;How Databricks’ AI agent transforms data engineering from manual craftsmanship into conversational pipeline development&lt;/EM&gt;&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="gg gh gi gj gk"&gt;
&lt;DIV class="v cf"&gt;
&lt;DIV class="cm bd fs ft fu fv"&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;Data engineers have long accepted a painful truth: building production-grade ETL pipelines means wrestling with hundreds of lines of orchestration code, manually encoding execution order, handling incremental processing logic, and then praying nothing breaks at 2 AM. Spark Declarative Pipelines (SDP) already simplified this dramatically by letting you declare&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM class="nb"&gt;what&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;your data should look like rather than&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM class="nb"&gt;how&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to get there. Now, with Genie Code in Agent mode, you don’t even have to write those declarations yourself.&lt;/P&gt;
&lt;FIGURE class="nn no np nq nr ns nk nl paragraph-image"&gt;
&lt;DIV class="nt nu ek nv bd nw" tabindex="0" role="button"&gt;&lt;SPAN class="eo ep eq ai er es et eu ev speechify-ignore"&gt;Press enter or click to view image in full size&lt;/SPAN&gt;
&lt;DIV class="nk nl nm"&gt;&lt;PICTURE&gt;&lt;SOURCE srcset="https://miro.medium.com/v2/resize:fit:640/format:webp/1*1VzZtaCw0WyY8glVdfkstA.png 640w, https://miro.medium.com/v2/resize:fit:720/format:webp/1*1VzZtaCw0WyY8glVdfkstA.png 720w, https://miro.medium.com/v2/resize:fit:750/format:webp/1*1VzZtaCw0WyY8glVdfkstA.png 750w, https://miro.medium.com/v2/resize:fit:786/format:webp/1*1VzZtaCw0WyY8glVdfkstA.png 786w, https://miro.medium.com/v2/resize:fit:828/format:webp/1*1VzZtaCw0WyY8glVdfkstA.png 828w, https://miro.medium.com/v2/resize:fit:1100/format:webp/1*1VzZtaCw0WyY8glVdfkstA.png 1100w, https://miro.medium.com/v2/resize:fit:1400/format:webp/1*1VzZtaCw0WyY8glVdfkstA.png 1400w" type="image/webp" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px"&gt;&lt;/SOURCE&gt;&lt;SOURCE srcset="https://miro.medium.com/v2/resize:fit:640/1*1VzZtaCw0WyY8glVdfkstA.png 640w, https://miro.medium.com/v2/resize:fit:720/1*1VzZtaCw0WyY8glVdfkstA.png 720w, https://miro.medium.com/v2/resize:fit:750/1*1VzZtaCw0WyY8glVdfkstA.png 750w, https://miro.medium.com/v2/resize:fit:786/1*1VzZtaCw0WyY8glVdfkstA.png 786w, https://miro.medium.com/v2/resize:fit:828/1*1VzZtaCw0WyY8glVdfkstA.png 828w, https://miro.medium.com/v2/resize:fit:1100/1*1VzZtaCw0WyY8glVdfkstA.png 1100w, https://miro.medium.com/v2/resize:fit:1400/1*1VzZtaCw0WyY8glVdfkstA.png 1400w" sizes="(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px" data-testid="og"&gt;&lt;/SOURCE&gt;&lt;/PICTURE&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="shwetav1407_0-1781633757203.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/27852iC312D69CBA3164EB/image-size/medium?v=v2&amp;amp;px=400" role="button" title="shwetav1407_0-1781633757203.png" alt="shwetav1407_0-1781633757203.png" /&gt;&lt;/span&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/FIGURE&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;In this guide, we’ll walk through building a complete medallion architecture pipeline using Genie Code and SDP — from raw ingestion through business-ready analytics — and explore the patterns that make this approach production-worthy.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="v cf nc nd ne nf" role="separator"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="gg gh gi gj gk"&gt;
&lt;DIV class="v cf"&gt;
&lt;DIV class="cm bd fs ft fu fv"&gt;
&lt;H2 id="f094" class="nz oa gn bb ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bg" data-selectable-paragraph=""&gt;What Is SDP, and Why Should You Care?&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Lakeflow Spark Declarative Pipelines (SDP)&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;is Databricks’ framework for building batch and streaming data pipelines in SQL and Python. Unlike traditional Spark jobs where you manually define execution order, manage checkpoints, and handle retries, SDP lets you declare your transformations and handles the orchestration automatically.&lt;/P&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;The key benefits that matter for real-world pipelines:&lt;/P&gt;
&lt;UL class=""&gt;
&lt;LI id="cd24" class="md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na pc pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Automatic orchestration&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— SDP analyzes dependencies across all your source files, builds a dataflow graph, and determines the optimal execution order with maximum parallelism. It also retries failures at the most granular level possible: first the Spark task, then the flow, then the pipeline.&lt;/LI&gt;
&lt;LI id="808f" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na pc pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Incremental processing built in&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— Materialized views automatically process only new data and changes. No more writing&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;MERGE&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;statements by hand.&lt;/LI&gt;
&lt;LI id="f435" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na pc pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Data quality as code&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— Expectations let you define quality constraints inline, right next to your transformations.&lt;/LI&gt;
&lt;LI id="16f3" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na pc pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Unified batch and streaming&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— Toggle between batch and streaming processing modes with a single keyword change.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;Here’s what that looks like compared to traditional approaches:&lt;/P&gt;
&lt;H3 id="ed67" class="po oa gn bb ob pp pq pr of ps pt pu oj mo pv pw px ms py pz qa mw qb qc qd qe bg" data-selectable-paragraph=""&gt;The Old Way (PySpark + Manual Orchestration)&lt;/H3&gt;
&lt;PRE class="nn no np nq nr qf pn qg bl qh ax bg"&gt;&lt;SPAN class="qi oa gn pn b bc qj qk e ql qm" data-selectable-paragraph=""&gt;&lt;SPAN class="hljs-comment"&gt;# Hundreds of lines for a simple weekly sales pipeline&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; pyspark.sql &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; SparkSession&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; pyspark.sql.functions &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; col, &lt;SPAN class="hljs-built_in"&gt;sum&lt;/SPAN&gt;, window&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; delta.tables &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; DeltaTable&lt;BR /&gt;&lt;BR /&gt;spark = SparkSession.builder.getOrCreate()&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-comment"&gt;# Step 1: Read raw data (manually handle incremental)&lt;/SPAN&gt;&lt;BR /&gt;raw_df = spark.read.&lt;SPAN class="hljs-built_in"&gt;format&lt;/SPAN&gt;(&lt;SPAN class="hljs-string"&gt;"delta"&lt;/SPAN&gt;).load(&lt;SPAN class="hljs-string"&gt;"/data/raw_sales"&lt;/SPAN&gt;)&lt;BR /&gt;last_processed = spark.read.&lt;SPAN class="hljs-built_in"&gt;format&lt;/SPAN&gt;(&lt;SPAN class="hljs-string"&gt;"delta"&lt;/SPAN&gt;) \&lt;BR /&gt;    .load(&lt;SPAN class="hljs-string"&gt;"/checkpoints/last_ts"&lt;/SPAN&gt;).collect()[&lt;SPAN class="hljs-number"&gt;0&lt;/SPAN&gt;][&lt;SPAN class="hljs-number"&gt;0&lt;/SPAN&gt;]&lt;BR /&gt;new_data = raw_df.&lt;SPAN class="hljs-built_in"&gt;filter&lt;/SPAN&gt;(col(&lt;SPAN class="hljs-string"&gt;"event_time"&lt;/SPAN&gt;) &amp;gt; last_processed)&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-comment"&gt;# Step 2: Clean (manually write quality checks)&lt;/SPAN&gt;&lt;BR /&gt;cleaned = new_data.&lt;SPAN class="hljs-built_in"&gt;filter&lt;/SPAN&gt;(&lt;BR /&gt;    col(&lt;SPAN class="hljs-string"&gt;"amount"&lt;/SPAN&gt;).isNotNull() &amp;amp; &lt;BR /&gt;    (col(&lt;SPAN class="hljs-string"&gt;"amount"&lt;/SPAN&gt;) &amp;gt; &lt;SPAN class="hljs-number"&gt;0&lt;/SPAN&gt;)&lt;BR /&gt;)&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-comment"&gt;# Step 3: Aggregate (manually handle upserts)&lt;/SPAN&gt;&lt;BR /&gt;weekly = cleaned.groupBy(&lt;BR /&gt;    window(&lt;SPAN class="hljs-string"&gt;"event_time"&lt;/SPAN&gt;, &lt;SPAN class="hljs-string"&gt;"1 week"&lt;/SPAN&gt;), &lt;SPAN class="hljs-string"&gt;"region"&lt;/SPAN&gt;&lt;BR /&gt;).agg(&lt;SPAN class="hljs-built_in"&gt;sum&lt;/SPAN&gt;(&lt;SPAN class="hljs-string"&gt;"amount"&lt;/SPAN&gt;).alias(&lt;SPAN class="hljs-string"&gt;"total_sales"&lt;/SPAN&gt;))&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-comment"&gt;# Step 4: Write (manually handle merge)&lt;/SPAN&gt;&lt;BR /&gt;target = DeltaTable.forPath(spark, &lt;SPAN class="hljs-string"&gt;"/data/weekly_sales"&lt;/SPAN&gt;)&lt;BR /&gt;target.alias(&lt;SPAN class="hljs-string"&gt;"t"&lt;/SPAN&gt;).merge(&lt;BR /&gt;    weekly.alias(&lt;SPAN class="hljs-string"&gt;"s"&lt;/SPAN&gt;),&lt;BR /&gt;    &lt;SPAN class="hljs-string"&gt;"t.window = s.window AND t.region = s.region"&lt;/SPAN&gt;&lt;BR /&gt;).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-comment"&gt;# Step 5: Update checkpoint (manually track state)&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-comment"&gt;# ... plus an Airflow DAG for scheduling, retries, alerting&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;H3 id="6aef" class="po oa gn bb ob pp pq pr of ps pt pu oj mo pv pw px ms py pz qa mw qb qc qd qe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="ah"&gt;The SDP Way (SQL)&lt;/STRONG&gt;&lt;/H3&gt;
&lt;PRE class="nn no np nq nr qf pn qg bl qh ax bg"&gt;&lt;SPAN class="qi oa gn pn b bc qj qk e ql qm" data-selectable-paragraph=""&gt;&lt;SPAN class="hljs-comment"&gt;-- The entire pipeline in a few declarations&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-comment"&gt;-- Bronze: raw ingestion with Auto Loader&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;CREATE&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;OR&lt;/SPAN&gt; REFRESH STREAMING &lt;SPAN class="hljs-keyword"&gt;TABLE&lt;/SPAN&gt; bronze_sales&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;SELECT&lt;/SPAN&gt; &lt;SPAN class="hljs-operator"&gt;*&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;FROM&lt;/SPAN&gt; STREAM read_files(&lt;BR /&gt;  &lt;SPAN class="hljs-string"&gt;'/data/landing/sales/'&lt;/SPAN&gt;,&lt;BR /&gt;  format &lt;SPAN class="hljs-operator"&gt;=&lt;/SPAN&gt;&lt;SPAN class="hljs-operator"&gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;'json'&lt;/SPAN&gt;,&lt;BR /&gt;  schema &lt;SPAN class="hljs-operator"&gt;=&lt;/SPAN&gt;&lt;SPAN class="hljs-operator"&gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;'event_time TIMESTAMP, region STRING, &lt;BR /&gt;             product STRING, amount DOUBLE'&lt;/SPAN&gt;&lt;BR /&gt;);&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-comment"&gt;-- Silver: cleansed with quality expectations&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;CREATE&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;OR&lt;/SPAN&gt; REFRESH STREAMING &lt;SPAN class="hljs-keyword"&gt;TABLE&lt;/SPAN&gt; silver_sales (&lt;BR /&gt;  &lt;SPAN class="hljs-keyword"&gt;CONSTRAINT&lt;/SPAN&gt; valid_amount EXPECT (amount &lt;SPAN class="hljs-operator"&gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN class="hljs-number"&gt;0&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;ON&lt;/SPAN&gt; VIOLATION &lt;SPAN class="hljs-keyword"&gt;DROP&lt;/SPAN&gt; &lt;SPAN class="hljs-type"&gt;ROW&lt;/SPAN&gt;,&lt;BR /&gt;  &lt;SPAN class="hljs-keyword"&gt;CONSTRAINT&lt;/SPAN&gt; not_null_region EXPECT (region &lt;SPAN class="hljs-keyword"&gt;IS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;NOT&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;NULL&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;ON&lt;/SPAN&gt; VIOLATION &lt;SPAN class="hljs-keyword"&gt;DROP&lt;/SPAN&gt; &lt;SPAN class="hljs-type"&gt;ROW&lt;/SPAN&gt;&lt;BR /&gt;)&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;SELECT&lt;/SPAN&gt;&lt;BR /&gt;  event_time,&lt;BR /&gt;  region,&lt;BR /&gt;  product,&lt;BR /&gt;  amount,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;current_timestamp&lt;/SPAN&gt;() &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; processed_at&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;FROM&lt;/SPAN&gt; STREAM(bronze_sales);&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-comment"&gt;-- Gold: business-ready weekly aggregation&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;CREATE&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;OR&lt;/SPAN&gt; REFRESH MATERIALIZED &lt;SPAN class="hljs-keyword"&gt;VIEW&lt;/SPAN&gt; gold_weekly_sales&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;SELECT&lt;/SPAN&gt;&lt;BR /&gt;  date_trunc(&lt;SPAN class="hljs-string"&gt;'week'&lt;/SPAN&gt;, event_time) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; week_start,&lt;BR /&gt;  region,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;COUNT&lt;/SPAN&gt;(&lt;SPAN class="hljs-operator"&gt;*&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; transaction_count,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;SUM&lt;/SPAN&gt;(amount) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; total_sales,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;AVG&lt;/SPAN&gt;(amount) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; avg_transaction&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;FROM&lt;/SPAN&gt; silver_sales&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;GROUP&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;BY&lt;/SPAN&gt; date_trunc(&lt;SPAN class="hljs-string"&gt;'week'&lt;/SPAN&gt;, event_time), region;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;That’s it. SDP handles incremental processing, execution order, retries, and checkpoint management. The bronze and silver tables use streaming semantics (the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;STREAM&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;keyword), while the gold materialized view uses batch semantics but still only reprocesses changed data.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="v cf nc nd ne nf" role="separator"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="gg gh gi gj gk"&gt;
&lt;DIV class="v cf"&gt;
&lt;DIV class="cm bd fs ft fu fv"&gt;
&lt;H2 id="1ce0" class="nz oa gn bb ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bg" data-selectable-paragraph=""&gt;Enter Genie Code: Your AI Data Engineering Partner&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;Now here’s where it gets interesting. Genie Code in Agent mode — available inside the Lakeflow Pipelines Editor — doesn’t just help you&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM class="nb"&gt;write&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;SDP code. It can autonomously&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG class="mf go"&gt;plan, generate, run, validate, and fix&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;entire pipelines from a single natural language prompt.&lt;/P&gt;
&lt;H2 id="6ebd" class="nz oa gn bb ob oc qn oe of og qo oi oj ok qp om on oo qq oq or os qr ou ov ow bg" data-selectable-paragraph=""&gt;How Genie Code Agent Mode Works&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;When you enable Agent mode in the Genie Code panel within the Lakeflow Pipelines Editor, the agent adapts its capabilities specifically for data engineering tasks. Unlike chat mode, Agent mode can:&lt;/P&gt;
&lt;OL class=""&gt;
&lt;LI id="809b" class="md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na qs pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Plan a multi-step solution&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and present it for your review&lt;/LI&gt;
&lt;LI id="1e1e" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na qs pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Search your Unity Catalog&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;for relevant tables, schemas, and lineage&lt;/LI&gt;
&lt;LI id="a111" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na qs pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Generate SQL or Python SDP source files&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in the pipeline editor&lt;/LI&gt;
&lt;LI id="28b7" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na qs pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Run pipeline updates&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and read the output datasets&lt;/LI&gt;
&lt;LI id="c7a6" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na qs pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Diagnose and fix errors&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;automatically, iterating until the pipeline succeeds&lt;/LI&gt;
&lt;LI id="fb45" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na qs pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Respect your Unity Catalog permissions&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— it can only access data you can access&lt;/LI&gt;
&lt;/OL&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;The key design principle is&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM class="nb"&gt;human-in-the-loop&lt;/EM&gt;: Genie Code proposes plans and asks for approval before executing. You can Allow, Decline, or ask it to try a different approach.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="v cf nc nd ne nf" role="separator"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="gg gh gi gj gk"&gt;
&lt;DIV class="v cf"&gt;
&lt;DIV class="cm bd fs ft fu fv"&gt;
&lt;H2 id="5d29" class="nz oa gn bb ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bg" data-selectable-paragraph=""&gt;Tutorial: Building a Medallion Pipeline with Genie Code&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;Let’s walk through building a real pipeline — an e-commerce analytics pipeline that ingests order data, cleans and enriches it, and produces dashboards-ready metrics.&lt;/P&gt;
&lt;H2 id="61c2" class="nz oa gn bb ob oc qn oe of og qo oi oj ok qp om on oo qq oq or os qr ou ov ow bg" data-selectable-paragraph=""&gt;Prerequisites&lt;/H2&gt;
&lt;UL class=""&gt;
&lt;LI id="8c94" class="md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na pc pd pe bg" data-selectable-paragraph=""&gt;A Databricks workspace with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG class="mf go"&gt;Partner-powered AI features&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;enabled&lt;/LI&gt;
&lt;LI id="cf6c" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na pc pd pe bg" data-selectable-paragraph=""&gt;Access to the Lakeflow Pipelines Editor&lt;/LI&gt;
&lt;LI id="ff20" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na pc pd pe bg" data-selectable-paragraph=""&gt;Unity Catalog configured with a target catalog and schema&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 id="03a6" class="nz oa gn bb ob oc qn oe of og qo oi oj ok qp om on oo qq oq or os qr ou ov ow bg" data-selectable-paragraph=""&gt;Step 1: Create Your Pipeline and Open Genie Code&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;Navigate to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG class="mf go"&gt;Pipelines&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in the sidebar and create a new pipeline. Give it a name like&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;ecommerce_analytics&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and set your target catalog and schema (e.g.,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;analytics.ecommerce&lt;/CODE&gt;).&lt;/P&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;Once in the Lakeflow Pipelines Editor, open the Genie Code panel and switch to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG class="mf go"&gt;Agent mode&lt;/STRONG&gt;.&lt;/P&gt;
&lt;H2 id="92ca" class="nz oa gn bb ob oc qn oe of og qo oi oj ok qp om on oo qq oq or os qr ou ov ow bg" data-selectable-paragraph=""&gt;Step 2: Prompt Genie Code to Build the Pipeline&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;Start with a descriptive prompt that tells Genie Code what you want:&lt;/P&gt;
&lt;BLOCKQUOTE class="qt qu qv"&gt;
&lt;P class="md me nb mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;&lt;EM class="gn"&gt;Your prompt:&lt;/EM&gt;&lt;/STRONG&gt;&lt;EM class="gn"&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;“Build a medallion architecture pipeline for e-commerce analytics. I have raw order data landing as JSON files in /Volumes/raw_data/orders/ with fields: order_id, customer_id, product_id, quantity, unit_price, order_timestamp, and shipping_region. Create bronze ingestion with Auto Loader, silver cleansing with quality expectations, and gold aggregations for daily revenue by region and top products.”&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;Genie Code will create a step-by-step plan that looks something like:&lt;/P&gt;
&lt;PRE class="nn no np nq nr qf pn qg bl qh ax bg"&gt;&lt;SPAN class="qi oa gn pn b bc qj qk e ql qm" data-selectable-paragraph=""&gt;Plan:&lt;BR /&gt;&lt;SPAN class="hljs-number"&gt;1.&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;Search&lt;/SPAN&gt; Unity Catalog &lt;SPAN class="hljs-keyword"&gt;for&lt;/SPAN&gt; existing related tables&lt;BR /&gt;&lt;SPAN class="hljs-number"&gt;2.&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;Create&lt;/SPAN&gt; bronze_orders.sql — streaming &lt;SPAN class="hljs-keyword"&gt;table&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;with&lt;/SPAN&gt; Auto Loader&lt;BR /&gt;&lt;SPAN class="hljs-number"&gt;3.&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;Create&lt;/SPAN&gt; silver_orders.sql — cleaned data &lt;SPAN class="hljs-keyword"&gt;with&lt;/SPAN&gt; expectations&lt;BR /&gt;&lt;SPAN class="hljs-number"&gt;4.&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;Create&lt;/SPAN&gt; gold_daily_revenue.sql — daily revenue &lt;SPAN class="hljs-keyword"&gt;by&lt;/SPAN&gt; region&lt;BR /&gt;&lt;SPAN class="hljs-number"&gt;5.&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;Create&lt;/SPAN&gt; gold_top_products.sql — top products materialized &lt;SPAN class="hljs-keyword"&gt;view&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-number"&gt;6.&lt;/SPAN&gt; Run the pipeline &lt;SPAN class="hljs-keyword"&gt;and&lt;/SPAN&gt; validate outputs&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;Review the plan, ask clarifying questions if needed, then select&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG class="mf go"&gt;Allow&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to let Genie Code proceed.&lt;/P&gt;
&lt;H2 id="214e" class="nz oa gn bb ob oc qn oe of og qo oi oj ok qp om on oo qq oq or os qr ou ov ow bg" data-selectable-paragraph=""&gt;Step 3: Watch Genie Code Generate Your Pipeline&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;Genie Code creates each source file in your pipeline. Here’s what the generated code typically looks like:&lt;/P&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;File:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;bronze_orders.sql&lt;/CODE&gt;&lt;/P&gt;
&lt;PRE class="nn no np nq nr qf pn qg bl qh ax bg"&gt;&lt;SPAN class="qi oa gn pn b bc qj qk e ql qm" data-selectable-paragraph=""&gt;&lt;SPAN class="hljs-comment"&gt;-- Bronze layer: raw ingestion from JSON landing zone&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;CREATE&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;OR&lt;/SPAN&gt; REFRESH STREAMING &lt;SPAN class="hljs-keyword"&gt;TABLE&lt;/SPAN&gt; bronze_orders&lt;BR /&gt;COMMENT &lt;SPAN class="hljs-string"&gt;'Raw e-commerce orders ingested via Auto Loader'&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;SELECT&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class="hljs-operator"&gt;*&lt;/SPAN&gt;,&lt;BR /&gt;  _metadata.file_name &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; source_file,&lt;BR /&gt;  _metadata.file_modification_time &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; file_mod_time,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;current_timestamp&lt;/SPAN&gt;() &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; ingestion_timestamp&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;FROM&lt;/SPAN&gt; STREAM read_files(&lt;BR /&gt;  &lt;SPAN class="hljs-string"&gt;'/Volumes/raw_data/orders/'&lt;/SPAN&gt;,&lt;BR /&gt;  format &lt;SPAN class="hljs-operator"&gt;=&lt;/SPAN&gt;&lt;SPAN class="hljs-operator"&gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;'json'&lt;/SPAN&gt;,&lt;BR /&gt;  inferColumnTypes &lt;SPAN class="hljs-operator"&gt;=&lt;/SPAN&gt;&lt;SPAN class="hljs-operator"&gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;'true'&lt;/SPAN&gt;&lt;BR /&gt;);&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;File:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;silver_orders.sql&lt;/CODE&gt;&lt;/P&gt;
&lt;PRE class="nn no np nq nr qf pn qg bl qh ax bg"&gt;&lt;SPAN class="qi oa gn pn b bc qj qk e ql qm" data-selectable-paragraph=""&gt;&lt;SPAN class="hljs-comment"&gt;-- Silver layer: cleansed and validated orders&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;CREATE&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;OR&lt;/SPAN&gt; REFRESH STREAMING &lt;SPAN class="hljs-keyword"&gt;TABLE&lt;/SPAN&gt; silver_orders (&lt;BR /&gt;  &lt;SPAN class="hljs-keyword"&gt;CONSTRAINT&lt;/SPAN&gt; valid_order_id &lt;BR /&gt;    EXPECT (order_id &lt;SPAN class="hljs-keyword"&gt;IS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;NOT&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;NULL&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;ON&lt;/SPAN&gt; VIOLATION &lt;SPAN class="hljs-keyword"&gt;DROP&lt;/SPAN&gt; &lt;SPAN class="hljs-type"&gt;ROW&lt;/SPAN&gt;,&lt;BR /&gt;  &lt;SPAN class="hljs-keyword"&gt;CONSTRAINT&lt;/SPAN&gt; valid_quantity &lt;BR /&gt;    EXPECT (quantity &lt;SPAN class="hljs-operator"&gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN class="hljs-number"&gt;0&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;AND&lt;/SPAN&gt; quantity &lt;SPAN class="hljs-operator"&gt;&amp;lt;&lt;/SPAN&gt; &lt;SPAN class="hljs-number"&gt;10000&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;ON&lt;/SPAN&gt; VIOLATION &lt;SPAN class="hljs-keyword"&gt;DROP&lt;/SPAN&gt; &lt;SPAN class="hljs-type"&gt;ROW&lt;/SPAN&gt;,&lt;BR /&gt;  &lt;SPAN class="hljs-keyword"&gt;CONSTRAINT&lt;/SPAN&gt; valid_price &lt;BR /&gt;    EXPECT (unit_price &lt;SPAN class="hljs-operator"&gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN class="hljs-number"&gt;0&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;ON&lt;/SPAN&gt; VIOLATION &lt;SPAN class="hljs-keyword"&gt;DROP&lt;/SPAN&gt; &lt;SPAN class="hljs-type"&gt;ROW&lt;/SPAN&gt;,&lt;BR /&gt;  &lt;SPAN class="hljs-keyword"&gt;CONSTRAINT&lt;/SPAN&gt; valid_timestamp &lt;BR /&gt;    EXPECT (order_timestamp &lt;SPAN class="hljs-keyword"&gt;IS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;NOT&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;NULL&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;ON&lt;/SPAN&gt; VIOLATION &lt;SPAN class="hljs-keyword"&gt;DROP&lt;/SPAN&gt; &lt;SPAN class="hljs-type"&gt;ROW&lt;/SPAN&gt;,&lt;BR /&gt;  &lt;SPAN class="hljs-keyword"&gt;CONSTRAINT&lt;/SPAN&gt; valid_region &lt;BR /&gt;    EXPECT (shipping_region &lt;SPAN class="hljs-keyword"&gt;IS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;NOT&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;NULL&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;ON&lt;/SPAN&gt; VIOLATION FAIL &lt;SPAN class="hljs-keyword"&gt;UPDATE&lt;/SPAN&gt;&lt;BR /&gt;)&lt;BR /&gt;COMMENT &lt;SPAN class="hljs-string"&gt;'Cleansed orders with quality expectations enforced'&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;SELECT&lt;/SPAN&gt;&lt;BR /&gt;  order_id,&lt;BR /&gt;  customer_id,&lt;BR /&gt;  product_id,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;CAST&lt;/SPAN&gt;(quantity &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-type"&gt;INT&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; quantity,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;CAST&lt;/SPAN&gt;(unit_price &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;DOUBLE&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; unit_price,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;CAST&lt;/SPAN&gt;(quantity &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-type"&gt;INT&lt;/SPAN&gt;) &lt;SPAN class="hljs-operator"&gt;*&lt;/SPAN&gt; &lt;SPAN class="hljs-built_in"&gt;CAST&lt;/SPAN&gt;(unit_price &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;DOUBLE&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; line_total,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;CAST&lt;/SPAN&gt;(order_timestamp &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-type"&gt;TIMESTAMP&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; order_timestamp,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;UPPER&lt;/SPAN&gt;(&lt;SPAN class="hljs-built_in"&gt;TRIM&lt;/SPAN&gt;(shipping_region)) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; shipping_region,&lt;BR /&gt;  ingestion_timestamp&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;FROM&lt;/SPAN&gt; STREAM(bronze_orders);&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;File:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;gold_daily_revenue.sql&lt;/CODE&gt;&lt;/P&gt;
&lt;PRE class="nn no np nq nr qf pn qg bl qh ax bg"&gt;&lt;SPAN class="qi oa gn pn b bc qj qk e ql qm" data-selectable-paragraph=""&gt;&lt;SPAN class="hljs-comment"&gt;-- Gold layer: daily revenue metrics by region&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;CREATE&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;OR&lt;/SPAN&gt; REFRESH MATERIALIZED &lt;SPAN class="hljs-keyword"&gt;VIEW&lt;/SPAN&gt; gold_daily_revenue&lt;BR /&gt;COMMENT &lt;SPAN class="hljs-string"&gt;'Daily revenue aggregation by shipping region'&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;SELECT&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class="hljs-type"&gt;DATE&lt;/SPAN&gt;(order_timestamp) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; order_date,&lt;BR /&gt;  shipping_region,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;COUNT&lt;/SPAN&gt;(&lt;SPAN class="hljs-keyword"&gt;DISTINCT&lt;/SPAN&gt; order_id) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; total_orders,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;COUNT&lt;/SPAN&gt;(&lt;SPAN class="hljs-keyword"&gt;DISTINCT&lt;/SPAN&gt; customer_id) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; unique_customers,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;SUM&lt;/SPAN&gt;(line_total) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; total_revenue,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;AVG&lt;/SPAN&gt;(line_total) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; avg_order_value,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;SUM&lt;/SPAN&gt;(quantity) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; total_units_sold&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;FROM&lt;/SPAN&gt; silver_orders&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;GROUP&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;BY&lt;/SPAN&gt; &lt;SPAN class="hljs-type"&gt;DATE&lt;/SPAN&gt;(order_timestamp), shipping_region;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;File:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;CODE class="db pk pl pm pn b"&gt;gold_top_products.sql&lt;/CODE&gt;&lt;/P&gt;
&lt;PRE class="nn no np nq nr qf pn qg bl qh ax bg"&gt;&lt;SPAN class="qi oa gn pn b bc qj qk e ql qm" data-selectable-paragraph=""&gt;&lt;SPAN class="hljs-comment"&gt;-- Gold layer: top products by revenue&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;CREATE&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;OR&lt;/SPAN&gt; REFRESH MATERIALIZED &lt;SPAN class="hljs-keyword"&gt;VIEW&lt;/SPAN&gt; gold_top_products&lt;BR /&gt;COMMENT &lt;SPAN class="hljs-string"&gt;'Product performance ranked by total revenue'&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;SELECT&lt;/SPAN&gt;&lt;BR /&gt;  product_id,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;COUNT&lt;/SPAN&gt;(&lt;SPAN class="hljs-keyword"&gt;DISTINCT&lt;/SPAN&gt; order_id) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; times_ordered,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;SUM&lt;/SPAN&gt;(quantity) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; total_units,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;SUM&lt;/SPAN&gt;(line_total) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; total_revenue,&lt;BR /&gt;  &lt;SPAN class="hljs-built_in"&gt;AVG&lt;/SPAN&gt;(unit_price) &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; avg_price&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;FROM&lt;/SPAN&gt; silver_orders&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;GROUP&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;BY&lt;/SPAN&gt; product_id;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;H2 id="26fd" class="nz oa gn bb ob oc qn oe of og qo oi oj ok qp om on oo qq oq or os qr ou ov ow bg" data-selectable-paragraph=""&gt;Step 4: Genie Code Runs and Validates&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;After generating the files, Genie Code asks for permission to run the pipeline. Once you approve it:&lt;/P&gt;
&lt;OL class=""&gt;
&lt;LI id="79c5" class="md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na qs pd pe bg" data-selectable-paragraph=""&gt;Triggers a pipeline update&lt;/LI&gt;
&lt;LI id="40e7" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na qs pd pe bg" data-selectable-paragraph=""&gt;Monitors execution across all flows&lt;/LI&gt;
&lt;LI id="b3f7" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na qs pd pe bg" data-selectable-paragraph=""&gt;Reads the output datasets to verify data landed correctly&lt;/LI&gt;
&lt;LI id="ae40" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na qs pd pe bg" data-selectable-paragraph=""&gt;Reports back with row counts, any expectation violations, and the DAG structure&lt;/LI&gt;
&lt;/OL&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;If something fails — say a schema mismatch in the JSON files — Genie Code diagnoses the error, proposes a fix (like adjusting the schema inference or adding a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;CAST&lt;/CODE&gt;), and iterates until the pipeline succeeds.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="gg gh gi gj gk"&gt;
&lt;DIV class="v cf"&gt;
&lt;DIV class="cm bd fs ft fu fv"&gt;
&lt;H2 id="3e95" class="nz oa gn bb ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bg" data-selectable-paragraph=""&gt;Going Deeper: Python SDP with Genie Code&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;While SQL is the most common approach, SDP also supports Python for more complex transformation logic. The Python API uses decorators from the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;pyspark.pipelines&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;module (imported as&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;dp&lt;/CODE&gt;).&lt;/P&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;Here’s what a Python-based silver layer might look like when you need custom transformation logic:&lt;/P&gt;
&lt;PRE class="nn no np nq nr qf pn qg bl qh ax bg"&gt;&lt;SPAN class="qi oa gn pn b bc qj qk e ql qm" data-selectable-paragraph=""&gt;&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; pyspark &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; pipelines &lt;SPAN class="hljs-keyword"&gt;as&lt;/SPAN&gt; dp&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;from&lt;/SPAN&gt; pyspark.sql.functions &lt;SPAN class="hljs-keyword"&gt;import&lt;/SPAN&gt; col, upper, trim, when, lit&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-meta"&gt;@dp.table(&lt;SPAN class="hljs-params"&gt;&lt;BR /&gt;    name=&lt;SPAN class="hljs-string"&gt;"silver_orders_enriched"&lt;/SPAN&gt;,&lt;BR /&gt;    comment=&lt;SPAN class="hljs-string"&gt;"Orders enriched with derived customer segments"&lt;/SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;)&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-meta"&gt;@dp.expect(&lt;SPAN class="hljs-params"&gt;&lt;SPAN class="hljs-string"&gt;"valid_order_id"&lt;/SPAN&gt;, &lt;SPAN class="hljs-string"&gt;"order_id IS NOT NULL"&lt;/SPAN&gt;, on_violation=&lt;SPAN class="hljs-string"&gt;"drop"&lt;/SPAN&gt;&lt;/SPAN&gt;)&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-meta"&gt;@dp.expect(&lt;SPAN class="hljs-params"&gt;&lt;SPAN class="hljs-string"&gt;"valid_amount"&lt;/SPAN&gt;, &lt;SPAN class="hljs-string"&gt;"line_total &amp;gt; 0"&lt;/SPAN&gt;, on_violation=&lt;SPAN class="hljs-string"&gt;"drop"&lt;/SPAN&gt;&lt;/SPAN&gt;)&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;def&lt;/SPAN&gt; &lt;SPAN class="hljs-title.function"&gt;silver_orders_enriched&lt;/SPAN&gt;():&lt;BR /&gt;    &lt;SPAN class="hljs-keyword"&gt;return&lt;/SPAN&gt; (&lt;BR /&gt;        spark.readStream.table(&lt;SPAN class="hljs-string"&gt;"bronze_orders"&lt;/SPAN&gt;)&lt;BR /&gt;        .withColumn(&lt;SPAN class="hljs-string"&gt;"line_total"&lt;/SPAN&gt;, col(&lt;SPAN class="hljs-string"&gt;"quantity"&lt;/SPAN&gt;) * col(&lt;SPAN class="hljs-string"&gt;"unit_price"&lt;/SPAN&gt;))&lt;BR /&gt;        .withColumn(&lt;SPAN class="hljs-string"&gt;"shipping_region"&lt;/SPAN&gt;, upper(trim(col(&lt;SPAN class="hljs-string"&gt;"shipping_region"&lt;/SPAN&gt;))))&lt;BR /&gt;        .withColumn(&lt;BR /&gt;            &lt;SPAN class="hljs-string"&gt;"customer_segment"&lt;/SPAN&gt;,&lt;BR /&gt;            when(col(&lt;SPAN class="hljs-string"&gt;"line_total"&lt;/SPAN&gt;) &amp;gt;= &lt;SPAN class="hljs-number"&gt;500&lt;/SPAN&gt;, lit(&lt;SPAN class="hljs-string"&gt;"premium"&lt;/SPAN&gt;))&lt;BR /&gt;            .when(col(&lt;SPAN class="hljs-string"&gt;"line_total"&lt;/SPAN&gt;) &amp;gt;= &lt;SPAN class="hljs-number"&gt;100&lt;/SPAN&gt;, lit(&lt;SPAN class="hljs-string"&gt;"standard"&lt;/SPAN&gt;))&lt;BR /&gt;            .otherwise(lit(&lt;SPAN class="hljs-string"&gt;"basic"&lt;/SPAN&gt;))&lt;BR /&gt;        )&lt;BR /&gt;        .withColumn(&lt;SPAN class="hljs-string"&gt;"order_date"&lt;/SPAN&gt;, col(&lt;SPAN class="hljs-string"&gt;"order_timestamp"&lt;/SPAN&gt;).cast(&lt;SPAN class="hljs-string"&gt;"date"&lt;/SPAN&gt;))&lt;BR /&gt;    )&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;You can ask Genie Code specifically for Python implementations:&lt;/P&gt;
&lt;BLOCKQUOTE class="qt qu qv"&gt;
&lt;P class="md me nb mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;&lt;EM class="gn"&gt;Your prompt:&lt;/EM&gt;&lt;/STRONG&gt;&lt;EM class="gn"&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;“Add a Python-based silver transformation that enriches orders with a customer loyalty tier based on historical order count from the customers table in analytics.core.”&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;Genie Code will search your Unity Catalog for the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;customers&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;table, understand its schema, and generate a Python file that joins and enriches appropriately.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="v cf nc nd ne nf" role="separator"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="gg gh gi gj gk"&gt;
&lt;DIV class="v cf"&gt;
&lt;DIV class="cm bd fs ft fu fv"&gt;
&lt;H2 id="97be" class="nz oa gn bb ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bg" data-selectable-paragraph=""&gt;Handling Change Data Capture (CDC)&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;One of SDP’s most powerful features is&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;AUTO CDC&lt;/CODE&gt;, which handles change data capture with full support for out-of-order events. This is where things get genuinely hard in traditional pipelines — and trivial in SDP.&lt;/P&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;SQL example for CDC with SCD Type 2:&lt;/STRONG&gt;&lt;/P&gt;
&lt;PRE class="nn no np nq nr qf pn qg bl qh ax bg"&gt;&lt;SPAN class="qi oa gn pn b bc qj qk e ql qm" data-selectable-paragraph=""&gt;&lt;SPAN class="hljs-comment"&gt;-- Streaming table to capture raw CDC events&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;CREATE&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;OR&lt;/SPAN&gt; REFRESH STREAMING &lt;SPAN class="hljs-keyword"&gt;TABLE&lt;/SPAN&gt; customers_cdc_raw&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;SELECT&lt;/SPAN&gt; &lt;SPAN class="hljs-operator"&gt;*&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;FROM&lt;/SPAN&gt; STREAM read_files(&lt;BR /&gt;  &lt;SPAN class="hljs-string"&gt;'/Volumes/raw_data/customers_cdc/'&lt;/SPAN&gt;,&lt;BR /&gt;  format &lt;SPAN class="hljs-operator"&gt;=&lt;/SPAN&gt;&lt;SPAN class="hljs-operator"&gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;'json'&lt;/SPAN&gt;&lt;BR /&gt;);&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-comment"&gt;-- Cleansed CDC with expectations&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;CREATE&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;OR&lt;/SPAN&gt; REFRESH STREAMING &lt;SPAN class="hljs-keyword"&gt;TABLE&lt;/SPAN&gt; customers_cdc_clean (&lt;BR /&gt;  &lt;SPAN class="hljs-keyword"&gt;CONSTRAINT&lt;/SPAN&gt; valid_id EXPECT (customer_id &lt;SPAN class="hljs-keyword"&gt;IS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;NOT&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;NULL&lt;/SPAN&gt;) &lt;SPAN class="hljs-keyword"&gt;ON&lt;/SPAN&gt; VIOLATION &lt;SPAN class="hljs-keyword"&gt;DROP&lt;/SPAN&gt; &lt;SPAN class="hljs-type"&gt;ROW&lt;/SPAN&gt;&lt;BR /&gt;)&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;SELECT&lt;/SPAN&gt;&lt;BR /&gt;  customer_id,&lt;BR /&gt;  name,&lt;BR /&gt;  email,&lt;BR /&gt;  address,&lt;BR /&gt;  operation,&lt;BR /&gt;  operation_timestamp&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;FROM&lt;/SPAN&gt; STREAM(customers_cdc_raw);&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-comment"&gt;-- Apply CDC changes with SCD Type 2 history tracking&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;CREATE&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;OR&lt;/SPAN&gt; REFRESH STREAMING &lt;SPAN class="hljs-keyword"&gt;TABLE&lt;/SPAN&gt; customers;&lt;BR /&gt;&lt;BR /&gt;AUTO CDC &lt;SPAN class="hljs-keyword"&gt;INTO&lt;/SPAN&gt; customers&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;FROM&lt;/SPAN&gt; STREAM(customers_cdc_clean)&lt;BR /&gt;KEYS (customer_id)&lt;BR /&gt;SEQUENCE &lt;SPAN class="hljs-keyword"&gt;BY&lt;/SPAN&gt; operation_timestamp&lt;BR /&gt;STORED &lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; SCD TYPE &lt;SPAN class="hljs-number"&gt;2&lt;/SPAN&gt;;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;You can prompt Genie Code with something like:&lt;/P&gt;
&lt;BLOCKQUOTE class="qt qu qv"&gt;
&lt;P class="md me nb mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;&lt;EM class="gn"&gt;Your prompt:&lt;/EM&gt;&lt;/STRONG&gt;&lt;EM class="gn"&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;“Add change data capture for customer updates from Debezium CDC events. I need SCD Type 2 to track historical changes to customer addresses.”&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;Genie Code understands the CDC patterns and generates the appropriate&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;AUTO CDC&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;declarations.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="v cf nc nd ne nf" role="separator"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="gg gh gi gj gk"&gt;
&lt;DIV class="v cf"&gt;
&lt;DIV class="cm bd fs ft fu fv"&gt;
&lt;H2 id="8cfb" class="nz oa gn bb ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bg" data-selectable-paragraph=""&gt;Data Quality Expectations: Your Safety Net&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;Expectations are SDP’s built-in data quality framework. There are three violation behaviors:&lt;/P&gt;
&lt;PRE class="nn no np nq nr qf pn qg bl qh ax bg"&gt;&lt;SPAN class="qi oa gn pn b bc qj qk e ql qm" data-selectable-paragraph=""&gt;&lt;BR /&gt;Behavior What Happens Use &lt;SPAN class="hljs-keyword"&gt;When&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;ON&lt;/SPAN&gt; VIOLATION &lt;SPAN class="hljs-keyword"&gt;DROP&lt;/SPAN&gt; &lt;SPAN class="hljs-type"&gt;ROW&lt;/SPAN&gt; Invalid &lt;SPAN class="hljs-keyword"&gt;rows&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;are&lt;/SPAN&gt; silently dropped Tolerating messy source data&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;ON&lt;/SPAN&gt; VIOLATION FAIL &lt;SPAN class="hljs-keyword"&gt;UPDATE&lt;/SPAN&gt; Entire pipeline &lt;SPAN class="hljs-keyword"&gt;update&lt;/SPAN&gt; fails Critical fields that must exist&lt;BR /&gt;(&lt;SPAN class="hljs-keyword"&gt;no&lt;/SPAN&gt; action specified) Invalid &lt;SPAN class="hljs-keyword"&gt;rows&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;are&lt;/SPAN&gt; logged but kept Monitoring &lt;SPAN class="hljs-keyword"&gt;without&lt;/SPAN&gt; blocking&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;FIGURE class="nn no np nq nr ns nk nl paragraph-image"&gt;
&lt;DIV class="nk nl qx"&gt;&lt;PICTURE&gt;&lt;SOURCE srcset="https://miro.medium.com/v2/resize:fit:640/format:webp/1*JdnpPtnz_EA8b0i-NeY-4Q.png 640w, https://miro.medium.com/v2/resize:fit:720/format:webp/1*JdnpPtnz_EA8b0i-NeY-4Q.png 720w, https://miro.medium.com/v2/resize:fit:750/format:webp/1*JdnpPtnz_EA8b0i-NeY-4Q.png 750w, https://miro.medium.com/v2/resize:fit:786/format:webp/1*JdnpPtnz_EA8b0i-NeY-4Q.png 786w, https://miro.medium.com/v2/resize:fit:828/format:webp/1*JdnpPtnz_EA8b0i-NeY-4Q.png 828w, https://miro.medium.com/v2/resize:fit:1100/format:webp/1*JdnpPtnz_EA8b0i-NeY-4Q.png 1100w, https://miro.medium.com/v2/resize:fit:1162/format:webp/1*JdnpPtnz_EA8b0i-NeY-4Q.png 1162w" type="image/webp" sizes="auto, (min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 581px"&gt;&lt;/SOURCE&gt;&lt;SOURCE srcset="https://miro.medium.com/v2/resize:fit:640/1*JdnpPtnz_EA8b0i-NeY-4Q.png 640w, https://miro.medium.com/v2/resize:fit:720/1*JdnpPtnz_EA8b0i-NeY-4Q.png 720w, https://miro.medium.com/v2/resize:fit:750/1*JdnpPtnz_EA8b0i-NeY-4Q.png 750w, https://miro.medium.com/v2/resize:fit:786/1*JdnpPtnz_EA8b0i-NeY-4Q.png 786w, https://miro.medium.com/v2/resize:fit:828/1*JdnpPtnz_EA8b0i-NeY-4Q.png 828w, https://miro.medium.com/v2/resize:fit:1100/1*JdnpPtnz_EA8b0i-NeY-4Q.png 1100w, https://miro.medium.com/v2/resize:fit:1162/1*JdnpPtnz_EA8b0i-NeY-4Q.png 1162w" sizes="auto, (min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 581px" data-testid="og"&gt;&lt;/SOURCE&gt;&lt;/PICTURE&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="shwetav1407_2-1781633756700.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/27851iD9527053953A76C4/image-size/medium?v=v2&amp;amp;px=400" role="button" title="shwetav1407_2-1781633756700.png" alt="shwetav1407_2-1781633756700.png" /&gt;&lt;/span&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/DIV&gt;
&lt;/FIGURE&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Pro tip:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Use Genie Code to add expectations iteratively. After an initial pipeline run, ask:&lt;/P&gt;
&lt;BLOCKQUOTE class="qt qu qv"&gt;
&lt;P class="md me nb mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;&lt;EM class="gn"&gt;“Analyze the bronze_orders data and suggest quality expectations for the silver layer based on the actual data distribution.”&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;Genie Code can read the output datasets, profile the data, and propose expectations that make sense for your actual data — not just generic null checks.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="gg gh gi gj gk"&gt;
&lt;DIV class="v cf"&gt;
&lt;DIV class="cm bd fs ft fu fv"&gt;
&lt;H2 id="c541" class="nz oa gn bb ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bg" data-selectable-paragraph=""&gt;Production Patterns and Best Practices&lt;/H2&gt;
&lt;H3 id="86c7" class="po oa gn bb ob pp pq pr of ps pt pu oj mo pv pw px ms py pz qa mw qb qc qd qe bg" data-selectable-paragraph=""&gt;1. Pipeline Configuration with YAML Spec&lt;/H3&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;Your pipeline project uses a YAML spec file for top-level configuration:&lt;/P&gt;
&lt;PRE class="nn no np nq nr qf pn qg bl qh ax bg"&gt;&lt;SPAN class="qi oa gn pn b bc qj qk e ql qm" data-selectable-paragraph=""&gt;&lt;SPAN class="hljs-comment"&gt;# pipeline.yaml&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-attr"&gt;name:&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;ecommerce_analytics&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-attr"&gt;target_catalog:&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;analytics&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-attr"&gt;target_schema:&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;ecommerce&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-attr"&gt;libraries:&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class="hljs-bullet"&gt;-&lt;/SPAN&gt; &lt;SPAN class="hljs-attr"&gt;path:&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;./bronze_orders.sql&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class="hljs-bullet"&gt;-&lt;/SPAN&gt; &lt;SPAN class="hljs-attr"&gt;path:&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;./silver_orders.sql&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class="hljs-bullet"&gt;-&lt;/SPAN&gt; &lt;SPAN class="hljs-attr"&gt;path:&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;./gold_daily_revenue.sql&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class="hljs-bullet"&gt;-&lt;/SPAN&gt; &lt;SPAN class="hljs-attr"&gt;path:&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;./gold_top_products.sql&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="hljs-attr"&gt;configuration:&lt;/SPAN&gt;&lt;BR /&gt;  &lt;SPAN class="hljs-attr"&gt;spark.sql.shuffle.partitions:&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;"auto"&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;H3 id="0336" class="po oa gn bb ob pp pq pr of ps pt pu oj mo pv pw px ms py pz qa mw qb qc qd qe bg" data-selectable-paragraph=""&gt;2. Parameterize with SET&lt;/H3&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;Use&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="db pk pl pm pn b"&gt;SET&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to inject environment-specific configurations:&lt;/P&gt;
&lt;PRE class="nn no np nq nr qf pn qg bl qh ax bg"&gt;&lt;SPAN class="qi oa gn pn b bc qj qk e ql qm" data-selectable-paragraph=""&gt;&lt;SPAN class="hljs-keyword"&gt;SET&lt;/SPAN&gt; env &lt;SPAN class="hljs-operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;'production'&lt;/SPAN&gt;;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;SET&lt;/SPAN&gt; raw_path &lt;SPAN class="hljs-operator"&gt;=&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;'/Volumes/${env}/raw_data/orders/'&lt;/SPAN&gt;;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;CREATE&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;OR&lt;/SPAN&gt; REFRESH STREAMING &lt;SPAN class="hljs-keyword"&gt;TABLE&lt;/SPAN&gt; bronze_orders&lt;BR /&gt;&lt;SPAN class="hljs-keyword"&gt;AS&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;SELECT&lt;/SPAN&gt; &lt;SPAN class="hljs-operator"&gt;*&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;FROM&lt;/SPAN&gt; STREAM read_files(&lt;BR /&gt;  &lt;SPAN class="hljs-string"&gt;'${raw_path}'&lt;/SPAN&gt;,&lt;BR /&gt;  format &lt;SPAN class="hljs-operator"&gt;=&lt;/SPAN&gt;&lt;SPAN class="hljs-operator"&gt;&amp;gt;&lt;/SPAN&gt; &lt;SPAN class="hljs-string"&gt;'json'&lt;/SPAN&gt;&lt;BR /&gt;);&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;H2 id="5033" class="nz oa gn bb ob oc qn oe of og qo oi oj ok qp om on oo qq oq or os qr ou ov ow bg" data-selectable-paragraph=""&gt;3. Mix SQL and Python Files&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;A single pipeline can contain both SQL and Python source files. Use SQL for straightforward transformations and Python when you need UDFs, ML feature engineering, or complex business logic.&lt;/P&gt;
&lt;H2 id="a2f1" class="nz oa gn bb ob oc qn oe of og qo oi oj ok qp om on oo qq oq or os qr ou ov ow bg" data-selectable-paragraph=""&gt;4. Use Genie Code for Ongoing Maintenance&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;Genie Code doesn’t just build pipelines — it monitors them. It can:&lt;/P&gt;
&lt;UL class=""&gt;
&lt;LI id="5269" class="md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na pc pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Triage failures&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;when a pipeline run breaks&lt;/LI&gt;
&lt;LI id="1554" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na pc pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Investigate anomalies&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in data quality metrics&lt;/LI&gt;
&lt;LI id="389e" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na pc pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Handle DBR upgrades&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;by updating deprecated syntax&lt;/LI&gt;
&lt;LI id="648c" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na pc pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Optimize resource allocation&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;based on observed workload patterns&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;Ask it things like:&lt;/P&gt;
&lt;BLOCKQUOTE class="qt qu qv"&gt;
&lt;P class="md me nb mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;&lt;EM class="gn"&gt;“The silver_orders pipeline has been failing since yesterday. Diagnose the issue.”&lt;/EM&gt;&lt;/P&gt;
&lt;P class="md me nb mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;&lt;EM class="gn"&gt;“Optimize the compute configuration for this pipeline — it’s running slowly on large backfills.”&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="gg gh gi gj gk"&gt;
&lt;DIV class="v cf"&gt;
&lt;DIV class="cm bd fs ft fu fv"&gt;
&lt;H2 id="9665" class="nz oa gn bb ob oc od oe of og oh oi oj ok ol om on oo op oq or os ot ou ov ow bg" data-selectable-paragraph=""&gt;Wrapping Up&lt;/H2&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg ox mi mj mk oy mm mn mo oz mq mr ms pa mu mv mw pb my mz na gg bg" data-selectable-paragraph=""&gt;The combination of SDP and Genie Code represents a genuine paradigm shift for data engineering on Databricks. SDP eliminates the boilerplate of pipeline orchestration, and Genie Code eliminates the boilerplate of writing SDP. What used to take days of manual pipeline construction can now happen in a single conversation.&lt;/P&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;The key takeaways:&lt;/P&gt;
&lt;UL class=""&gt;
&lt;LI id="bf76" class="md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na pc pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Start with SDP&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— even without Genie Code, the declarative approach saves enormous amounts of manual orchestration code.&lt;/LI&gt;
&lt;LI id="aacf" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na pc pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Use Genie Code Agent mode&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in the Lakeflow Pipelines Editor to plan, generate, and validate entire pipelines from natural language.&lt;/LI&gt;
&lt;LI id="7f69" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na pc pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Build iteratively&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— start with a basic bronze-silver-gold structure, then ask Genie Code to add CDC handling, expectations, and enrichments.&lt;/LI&gt;
&lt;LI id="d742" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na pc pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Trust the loop&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— Genie Code’s ability to run the pipeline, read outputs, diagnose errors, and fix them autonomously is the real superpower.&lt;/LI&gt;
&lt;LI id="d545" class="md me gn mf b mg pf mi mj mk pg mm mn mo ph mq mr ms pi mu mv mw pj my mz na pc pd pe bg" data-selectable-paragraph=""&gt;&lt;STRONG class="mf go"&gt;Keep humans in control&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— every destructive action requires your approval. Genie Code proposes; you decide.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;SDP and Genie Code are both generally available today at no additional cost for all Databricks customers. Open the Lakeflow Pipelines Editor, flip on Agent mode, and start talking to your data infrastructure.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="gg gh gi gj gk"&gt;
&lt;DIV class="v cf"&gt;
&lt;DIV class="cm bd fs ft fu fv"&gt;
&lt;P class="pw-post-body-paragraph md me gn mf b mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na gg bg" data-selectable-paragraph=""&gt;&lt;EM class="nb"&gt;Ready to get started? Check out the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/EM&gt;&lt;A class="z qy" href="https://docs.databricks.com/aws/en/ldp/" rel="noopener ugc nofollow" target="_blank"&gt;&lt;EM class="nb"&gt;Databricks SDP documentation&lt;/EM&gt;&lt;/A&gt;&lt;EM class="nb"&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/EM&gt;&lt;A class="z qy" href="https://docs.databricks.com/aws/en/ldp/de-agent" rel="noopener ugc nofollow" target="_blank"&gt;&lt;EM class="nb"&gt;Genie Code guide for pipeline development&lt;/EM&gt;&lt;/A&gt;&lt;EM class="nb"&gt;.&lt;/EM&gt;&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;</description>
      <pubDate>Tue, 16 Jun 2026 18:22:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/building-production-ready-sdp-pipelines-with-genie-code-the/m-p/159195#M1281</guid>
      <dc:creator>shwetav1407</dc:creator>
      <dc:date>2026-06-16T18:22:36Z</dc:date>
    </item>
    <item>
      <title>The Comparison: Why the Alternatives Fall Short for Databricks-Native Agentic Systems</title>
      <link>https://community.databricks.com/t5/community-articles/the-comparison-why-the-alternatives-fall-short-for-databricks/m-p/159186#M1280</link>
      <description>&lt;P class=""&gt;&lt;EM&gt;The OLTP architecture your agentic systems actually need, and how it compares to Supabase, Azure PostgreSQL, and Cosmos DB&lt;/EM&gt;&lt;/P&gt;&lt;HR /&gt;&lt;P class=""&gt;Earlier this year, Nikita Shamgunov — the engineer leading Databricks Lakebase — published a number that reframed my entire architecture review: AI agents now create roughly 4x more databases than human developers.&lt;/P&gt;&lt;P class=""&gt;Not 4x more queries. 4x more &lt;EM&gt;databases&lt;/EM&gt;.&lt;/P&gt;&lt;P class=""&gt;If you're building agentic AI systems on Databricks and still reaching for Supabase, Azure Database for PostgreSQL, or Cosmos DB as your OLTP layer — this article will challenge that decision. Not because those platforms are bad. They're not. But because they were designed for a world where humans write schemas, humans provision databases, and humans decide when something scales. Agents don't work that way. And the architecture that serves human-paced development quietly breaks under agentic workloads.&lt;/P&gt;&lt;P class=""&gt;I learned this the hard way while building an internal Agentic Intelligence Platform at Celebal Technologies — three agent modules (Swarm Coordination, Ontology-Based Reasoning, and Causal Optimization) sharing a unified LLMOps spine on Databricks. I'll show you exactly what I got wrong in the database layer, what Lakebase changes, and how the alternatives stack up for teams building enterprise AI on Databricks.&lt;/P&gt;&lt;HR /&gt;&lt;H3 id="ember733"&gt;The Problem: Agents Don't Use Databases the Way Humans Do&lt;/H3&gt;&lt;P class=""&gt;Traditional database architecture assumes a human-paced world. Applications write transactions. Dashboards read. ETL pipelines shuttle data between the OLTP and OLAP layers. The entire stack was designed around predictable access patterns and a well-understood divide between operational and analytical data.&lt;/P&gt;&lt;P class=""&gt;Agents shatter all three of those assumptions simultaneously.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;They're inherently ephemeral.&lt;/STRONG&gt; A swarm agent coordinating a supply chain analysis spins up, decomposes a task across five specialist agents, writes hundreds of state checkpoints, and terminates — all in under thirty seconds. The next invocation may run on a completely different thread with zero shared context from the prior session. Legacy databases aren't built for disposable, bursty compute that needs to scale to zero between workloads and spin back up instantly for the next one.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;They generate massive, high-frequency state churn.&lt;/STRONG&gt; Every tool call, reasoning step, context retrieval, and handoff between agents is a potential checkpoint. For a multi-turn swarm agent handling a complex analytical task, that's hundreds of writes per session — each requiring exact-ID retrieval by thread_id or session_id, not vector similarity search. Postgres handles this natively. A Delta table, even a well-ZORDER'd one, adds overhead for an access pattern it was never designed to serve.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;They need to reach analytical data without crossing a platform boundary.&lt;/STRONG&gt; An agent recommending inventory adjustments needs to query the Gold Delta tables — the same tables your ML models trained on, governed by the same Unity Catalog policies your data engineering team enforces. If your OLTP layer lives outside Databricks, you're building a data copy pipeline just so your agent can read data that's already on the platform.&lt;/P&gt;&lt;P class=""&gt;That third problem is where I went wrong.&lt;/P&gt;&lt;HR /&gt;&lt;H3 id="ember740"&gt;The Mistake I Made Building the Agentic Platform&lt;/H3&gt;&lt;P class=""&gt;When I built the Swarm Coordination module of our Agentic Intelligence Platform, I used a Unity Catalog Delta table as the shared persistent memory store for multi-turn agent sessions. Delta was a reasonable first choice — it gave me time travel for session debugging, UC lineage on every agent write, and the ability to query session history in SparkSQL.&lt;/P&gt;&lt;P class=""&gt;But Delta is an OLAP-optimized storage format. When the coordinator agent needed to retrieve the exact current state for a specific thread_id, it was running a scan-optimized query engine against a point-lookup workload. I added ZORDER on (session_id, turn_number) and tuned file sizes — which helped. But it was always the wrong tool for the access pattern.&lt;/P&gt;&lt;P class=""&gt;What the architecture actually needed was a clean separation of concerns:&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Short-term session state&lt;/STRONG&gt; (checkpoints, thread context, current turn, handoff records) → a transactional store with exact-ID retrieval and sub-10ms read latency&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Long-term episodic memory&lt;/STRONG&gt; (past session summaries, cross-session reasoning patterns, performance analytics) → Delta Lake, where batch SparkSQL queries and Lakehouse Monitoring make sense&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;Lakebase is the transactional half of that equation. And it's the piece I didn't have.&lt;/P&gt;&lt;HR /&gt;&lt;H3 id="ember746"&gt;What Lakebase Provides for Agentic Systems&lt;/H3&gt;&lt;P class=""&gt;Lakebase is Databricks' fully managed, serverless PostgreSQL database — built on the Neon architecture (which Databricks acquired) and integrated natively into the Databricks platform. It reached General Availability in February 2026. Here are the capabilities that directly change the agent architecture:&lt;/P&gt;&lt;H3 id="ember748"&gt;Native LangGraph Checkpointing&lt;/H3&gt;&lt;P class=""&gt;Lakebase is a supported LangGraph checkpointer backend on both Databricks Apps and Model Serving endpoints. Authentication between your application and Lakebase is resolved automatically through the platform's Service Principal — no credential management in application code, no secret rotation for a separate database connection string.&lt;/P&gt;&lt;PRE&gt;from langgraph.checkpoint.postgres import PostgresSaver
from databricks.sdk import WorkspaceClient

# Databricks resolves authentication automatically via Service Principal
w = WorkspaceClient()
conn_str = w.lakebase.get_connection_string(instance_name="agent-state-prod")

# LangGraph Postgres checkpointer backed by Lakebase
checkpointer = PostgresSaver.from_conn_string(conn_str)

# The agent now has durable, OLTP-grade session state
agent = create_react_agent(model, tools, checkpointer=checkpointer)&lt;/PRE&gt;&lt;P class=""&gt;This is the pattern you'd apply to the Swarm Coordination module. The coordinator's session state — which agent it's routing to, which specialist has already responded, the current confidence score — lives in Lakebase. The MLflow Trace of the full execution graph is separate (logged as a Databricks artifact). Two different concerns, two different stores, each doing what it does best.&lt;/P&gt;&lt;H3 id="ember751"&gt;Instant Database Branching for Agent Experimentation&lt;/H3&gt;&lt;P class=""&gt;This is the capability that directly addresses the "4x more databases" pattern. Lakebase supports copy-on-write branching: a full, isolated branch of a production-scale database in under one second, at near-zero initial storage cost (only diffs are written on change).&lt;/P&gt;&lt;P class=""&gt;For agents, this changes what's possible:&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;A Causal Optimization agent running counterfactual "what-if" scenarios can branch the intervention state, explore the outcome, and discard the branch — without any risk to the production state&lt;/LI&gt;&lt;LI&gt;An agent autonomously testing schema migrations can branch, run the migration, validate, and either promote or roll back in a single API call&lt;/LI&gt;&lt;LI&gt;Development environments for agent workflows are ephemeral by default, provisioned and torn down programmatically&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;Databricks telemetry shows production Lakebase deployments averaging roughly 10 branches per database project, with some agent-driven workflows reaching hundreds of nested iterations. That pattern is structurally impossible with traditional managed Postgres where creating a copy requires duplicating the full storage filesystem.&lt;/P&gt;&lt;H3 id="ember756"&gt;Autoscaling with Scale-to-Zero&lt;/H3&gt;&lt;P class=""&gt;Agent workloads are bursty in a way that application workloads rarely are. Thousands of concurrent sessions during business hours, complete silence at 2am. Lakebase scales its compute up under load and down to zero between workloads — costs align with actual usage, not provisioned capacity. For multi-agent platforms running on Databricks Apps, this means the transactional backend matches the compute model of the application layer itself.&lt;/P&gt;&lt;H3 id="ember758"&gt;Managed Delta Sync — The ETL Eliminator&lt;/H3&gt;&lt;P class=""&gt;Every write to Lakebase is automatically synced to Delta tables in Unity Catalog. For agent systems, this is what closes the long-term memory loop without custom code:&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Agent session checkpoints (short-term) → Lakebase → automatic Delta sync → Gold layer for analysis&lt;/LI&gt;&lt;LI&gt;Lakehouse Monitoring can track agent reasoning drift, latency patterns, and success rate from the Delta-synced inference data&lt;/LI&gt;&lt;LI&gt;The grid operations team in our Solar Forecasting project needed low-latency reads on Gold forecast data — we built a data copy pipeline as a workaround that added latency and a maintenance surface.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3 id="ember761"&gt;Unity Catalog as the Single Governance Layer&lt;/H3&gt;&lt;P class=""&gt;Lakebase instances are registered in Unity Catalog under the same 3-level namespace as your Delta tables and ML models. The same row-level security policies, column masking, lineage graphs, and access audit logs that govern energy_nz.solar.gold also govern the Lakebase instance storing agent session state. For enterprise AI systems operating under regulatory oversight, this is a structural requirement — not a preference.&lt;/P&gt;&lt;HR /&gt;&lt;H3 id="ember763"&gt;The Comparison: Why the Alternatives Fall Short for Databricks-Native Agentic Systems&lt;/H3&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Blitzkrieg_2-1781631361677.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/27850i5E576E8C3A09B678/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Blitzkrieg_2-1781631361677.png" alt="Blitzkrieg_2-1781631361677.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;P class=""&gt;&lt;STRONG&gt;Supabase&lt;/STRONG&gt; is an excellent platform for its target use case. Postgres, auth, storage, real-time subscriptions, and edge functions bundled into a working backend in minutes — at $25/month, it's exceptionally competitive for early-stage web applications. But for enterprise agentic systems on Databricks, there are two structural gaps that don't close with configuration: there is no Unity Catalog (agents operating on governed enterprise data need the same governance layer as the data itself), and there is no Lakehouse sync (analytical data still requires an ETL pipeline to reach Supabase, and Supabase data requires an ETL pipeline to reach the Lakehouse for monitoring and ML). Supabase asks you to build and maintain that bridge. Lakebase eliminates it.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Azure Database for PostgreSQL Flexible Server&lt;/STRONG&gt; is a solid choice for traditional Azure-native transactional workloads. But compute and storage are coupled together — creating an isolated development copy of a production database requires duplicating the full storage volume, an operation measured in hours and charged by the gigabyte. There is no native database branching, no Lakehouse sync, and the governance model (Azure RBAC) is entirely separate from Unity Catalog. For teams building on Azure Databricks who want a single governance boundary across OLTP, OLAP, and ML — this means managing two different access control systems with no native bridge between them.&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Azure Cosmos DB&lt;/STRONG&gt; is purpose-built for globally distributed, multi-region, flexible-schema NoSQL workloads — a genuinely different problem from agentic state management. It's not PostgreSQL-compatible, which means LangGraph's Postgres checkpointer doesn't apply, standard psycopg2 drivers don't connect, and the document model doesn't naturally represent the relational shape of session checkpoints and handoff records. Cosmos DB is the right answer for a different question.&lt;/P&gt;&lt;HR /&gt;&lt;H3 id="ember769"&gt;What I'd Rebuild in the Agentic Platform&lt;/H3&gt;&lt;P class=""&gt;With Lakebase available, the architecture for the three modules changes specifically:&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Module 1 — Swarm Coordination:&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Coordinator checkpoint store → Lakebase&lt;/STRONG&gt;: thread state, current turn context, handoff records, confidence scores per routing decision. LangGraph Postgres checkpointer on Databricks Apps, authentication via Service Principal.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Agent episodic memory → Delta Lake (unchanged)&lt;/STRONG&gt;: cross-session analytical queries, SHAP analysis across sessions, Lakehouse Monitoring on reasoning patterns. Lakebase managed sync keeps Delta current automatically.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Module 2 — Ontology-Based Reasoning:&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Ontology triples → Delta (unchanged)&lt;/STRONG&gt;: batch reads by the re-ranking gate, SQL queries for sub-graph retrieval. No change needed here — this is an OLAP access pattern.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Grounding cache → Lakebase&lt;/STRONG&gt;: frequently accessed ontology sub-graphs cached in Postgres for sub-50ms retrieval during the agent's inner reasoning loop.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;Module 3 — Causal Optimization:&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Intervention results → Lakebase → managed Delta sync&lt;/STRONG&gt;: causal engine writes intervention outcomes (high-frequency, transactional) to Lakebase. Sync pushes results to the Gold Delta layer for downstream analytics without custom ETL.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Causal DAG structure → Delta (unchanged)&lt;/STRONG&gt;: the DAG (edges, confidence scores, version history) is read by batch retraining jobs after PSI-triggered re-learning. Delta time travel for DAG versioning is already the right pattern here.&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;The net effect: short-term transactional operations at Postgres latency, long-term analytical operations at Delta scale, a single Unity Catalog governance layer across both, and zero custom ETL pipelines connecting them.&lt;/P&gt;&lt;HR /&gt;&lt;H3 id="ember778"&gt;When Lakebase Isn't the Answer&lt;/H3&gt;&lt;P class=""&gt;A credible recommendation has boundaries. Lakebase is not the right choice when:&lt;/P&gt;&lt;P class=""&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Your OLTP workload is genuinely independent of analytics — a standalone web app with no ML components or Lakehouse integration doesn't benefit from the co-location.&lt;/LI&gt;&lt;LI&gt;You need niche Postgres extensions not yet supported in Lakebase's managed environment (specialized GIS, custom time-series extensions).&lt;/LI&gt;&lt;LI&gt;You're building a consumer-facing mobile application where Supabase's bundled auth, storage, and real-time subscriptions are the actual product value.&lt;/LI&gt;&lt;LI&gt;You're not on Databricks. The Lakehouse integration is the primary differentiation — without it, Lakebase is a well-engineered managed Postgres, but not a category-defining choice.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;The decision criterion is simple: how close is your agent workload to your Databricks analytics and ML stack? The closer it is, the more Lakebase earns its place.&lt;/P&gt;&lt;HR /&gt;&lt;H3 id="ember782"&gt;The Larger Picture&lt;/H3&gt;&lt;P class=""&gt;Databricks started as the platform where you process and model data. Unity Catalog is the platform where you govern data. Lakebase makes it the platform where you run transactional applications &lt;EM&gt;on&lt;/EM&gt; that data — without copying it, without bridging governance models, without maintaining a second operational stack alongside your analytics stack.&lt;/P&gt;&lt;P class=""&gt;The 4x database creation stat isn't a curiosity. It's a forcing function. When agents provision databases at that rate, every architectural inefficiency — the manual provisioning, the ETL pipeline, the separate governance model — compounds at agent speed. Human architects designed those inefficiencies in; agents will expose them.&lt;/P&gt;&lt;P class=""&gt;After rebuilding the Agentic Platform architecture mentally with Lakebase in place, the change is not additive — it's structural. It's the difference between three systems (OLTP, OLAP, ML) connected by pipelines you maintain, and one platform where those boundaries exist only in your mental model.&lt;/P&gt;&lt;HR /&gt;&lt;P class=""&gt;&lt;EM&gt;If this resonated, I'd welcome your thoughts in the comments — especially if you've hit the OLTP/OLAP boundary problem in your own agentic architectures. What did your workaround look like?&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jun 2026 17:48:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/the-comparison-why-the-alternatives-fall-short-for-databricks/m-p/159186#M1280</guid>
      <dc:creator>Agre_Celebal</dc:creator>
      <dc:date>2026-06-16T17:48:47Z</dc:date>
    </item>
    <item>
      <title>Gold Layer Design on Databricks — MERGE vs Overwrite, Partitioning, SCD Type 2 from SAP</title>
      <link>https://community.databricks.com/t5/community-articles/gold-layer-design-on-databricks-merge-vs-overwrite-partitioning/m-p/159034#M1279</link>
      <description>&lt;P&gt;Part 3 of my series on building an enterprise data platform on Databricks is up - this one cover Gold layer design.&lt;/P&gt;&lt;P&gt;The short version: Gold isn't just aggregated Silver. Silver maps to your source system. Gold maps to the business questions your consumers are actually asking - and those two things are almost never the same shape.&lt;/P&gt;&lt;P&gt;What's in the post:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;MERGE vs overwrites for Gold writes, and the threshold where we switched (40min overwrite runs on vendor_balance at ~10M rows)&lt;/LI&gt;&lt;LI&gt;Partitioning strategy for financial tables: BUKRS+GJAHR for period aggregates, BUKRS alone for balances, no partition on dimensions&lt;/LI&gt;&lt;LI&gt;Z-ordering on LIFNR+MONAT for finance report query patterns&lt;/LI&gt;&lt;LI&gt;SCD Type 2 from SAP master data using a validity window at Gold&lt;/LI&gt;&lt;LI&gt;What doesn't belong in Gold — and the two days we spent auditing a table we eventually deleted&lt;/LI&gt;&lt;LI&gt;Full vendor_balance Gold table in PySpark with MERGE pattern&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;This is Part 3 of 5. Parts 1 and 2 covered Bronze ingestion (GoldenGate + Kafka + Structured Streaming alongside JDBC historical load) and Silver reconciliation. Part 4 is about why three-layer medallion wasn't enough and what we added.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Full post:&lt;/STRONG&gt; &lt;A title="Designing the Gold Layer on Databricks — What Belongs and What Doesn’t" href="https://medium.com/@savlahanish/gold-is-not-just-aggregated-silver-designing-for-business-questions-035d14102459" target="_self"&gt;Designing the Gold Layer on Databricks — What Belongs and What Doesn’t&lt;/A&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Happy to answer questions on any of the decisions — there were a few where we went back and forth longer than we should have.&lt;/P&gt;</description>
      <pubDate>Mon, 15 Jun 2026 11:29:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/gold-layer-design-on-databricks-merge-vs-overwrite-partitioning/m-p/159034#M1279</guid>
      <dc:creator>savlahanish27</dc:creator>
      <dc:date>2026-06-15T11:29:39Z</dc:date>
    </item>
  </channel>
</rss>

