Community Articles topics

LTAP: What Databricks New Transactional-Analytical Architecture Means for Data Engineers

AmitDECopilot — Sat, 27 Jun 2026 13:06:30 GMT

For years, enterprise data architecture has followed a familiar pattern.

An application writes customer orders, account updates, inventory changes, or transactions into an operational database.

Then data engineering takes over.

We capture changes through CDC. We land them in a lake or warehouse. We transform them through multiple layers. We create curated tables for analytics. Then, in many cases, we move enriched data back into an application through APIs, reverse ETL, or another synchronization process.

A simplified version looks like this:

Operational Application Database
        ↓
CDC / Replication
        ↓
Landing Layer
        ↓
Transformation Pipelines
        ↓
Lakehouse / Warehouse
        ↓
Dashboards, ML Models, AI Agents
        ↓
Reverse ETL / API / Application Sync

This architecture is common for a reason. It works.

But it also creates several familiar problems:

Multiple copies of the same business entity
Delays between application activity and analytical availability
CDC failures and schema drift
Reconciliation effort between operational and analytical views
Complex reverse ETL or API layers
Different governance models across different systems
AI applications operating on stale or incomplete context

Databricks’ new LTAP architecture is interesting because it challenges the assumption that transactional and analytical data must always live in separate worlds.

What Is LTAP?

LTAP stands for Lake Transactional/Analytical Processing.

The idea is not simply to run OLTP and OLAP workloads inside one engine.

Instead, LTAP aims to bring transactional, analytical, streaming, and AI application workloads closer to a shared governed data foundation.

Databricks positions Lakebase as the transactional layer in this model: a managed Postgres-compatible database integrated with the broader Databricks platform. The architectural goal is to reduce the need for separate copies, replicated pipelines, and synchronization layers between applications and analytics.

In simple terms, LTAP asks:

What if an operational application, an analytics team, and an AI agent could work from a much closer version of the same governed data foundation?

That is a meaningful question for data engineers.

The Traditional Gap Between OLTP and OLAP

Let us take a simple customer-order scenario.

A customer places an order through an e-commerce application.

The application writes the order into an operational database.

The data engineering team then captures the change, transforms it, enriches it with customer and product data, and publishes it to analytics tables.

Later, an AI assistant may use that data to answer questions such as:

Why did this customer’s order fail?
Is this customer eligible for a retention offer?
What products are frequently purchased together?
Is there a fraud or fulfillment risk?
Should the application trigger a proactive action?

In a traditional architecture, each of those steps may involve separate systems and delayed synchronization.

Application Database
        ↓
CDC Pipeline
        ↓
Bronze / Raw Layer
        ↓
Silver / Cleansed Layer
        ↓
Gold / Analytics Layer
        ↓
Feature Store / API / Reverse ETL
        ↓
Application or AI Agent

The problem is not that any individual layer is bad.

The problem is that every handoff creates additional operational responsibility.

Someone must monitor the pipeline.

Someone must handle a failed CDC batch.

Someone must reconcile the dashboard number with the application number.

Someone must decide what happens when the source schema changes.

Someone must explain why the AI assistant used yesterday’s data while the application showed a newer transaction.

LTAP does not make these concerns disappear completely. But it creates a new architectural option for reducing unnecessary distance between the application, the data platform, and the intelligence layer.

What Changes With LTAP?

The most important shift is not “Databricks now supports transactions.”

The more important shift is:

The transactional and analytical worlds can be designed around a more unified storage and governance foundation.

That could reduce several common integration patterns:

Before:
Operational DB → CDC → Lakehouse → Reverse ETL → Application

Potential LTAP Pattern:
Application + Operational Data + Analytics + AI Context
        ↓
Shared Governed Data Foundation

For the right use cases, this can reduce:

Data replication
Pipeline latency
Reconciliation complexity
Reverse ETL maintenance
Fragmented security models
Duplicate lineage documentation
Delayed context for AI-powered applications

However, this does not mean every existing operational database should be moved immediately.

LTAP is a design option, not a universal replacement strategy.

A Practical Example: Customer Support and Fraud Review

Consider a customer-support or fraud-investigation workflow.

A support agent needs to see the latest customer profile, recent transactions, risk indicators, product history, and open service cases.

A fraud analyst needs historical behavior, anomaly scores, device patterns, and transaction trends.

An AI assistant needs governed context before it recommends an action.

In a traditional architecture, these could be spread across:

An application database
A CRM system
A data warehouse
A feature store
A vector database
A reverse ETL tool
Several APIs

That means the agent or application may operate on a mixture of current and delayed information.

An LTAP-style architecture could allow teams to design this more directly:

Customer Transaction
        ↓
Transactional Operational State
        ↓
Shared Governed Data Foundation
        ↓
Analytics / Risk Models / AI Agent Context
        ↓
Human or Application Action

The value is not simply speed.

The value is that operational action, analytical understanding, and AI recommendation can be designed around more consistent data context.

Where Data Engineers Still Matter

There is a temptation with new platform architectures to assume that fewer pipelines means less data engineering.

I see it differently.

LTAP may reduce unnecessary plumbing, but it makes data engineering decisions even more important.

Teams will still need to design:

1. Workload Boundaries

Not every workload needs real-time access.

Some data should remain asynchronous because of cost, scale, reliability, or business-process requirements.

A daily finance reconciliation process does not necessarily need the same architecture as a real-time fraud decision.

2. Data Contracts

If operational and analytical workloads are closer together, schema discipline becomes more important.

A small application-side schema change can have downstream impact on:

Analytics
Machine learning features
AI agent context
Data quality rules
Regulatory reports
Customer-facing workflows

Data contracts, schema evolution rules, and impact analysis remain essential.

3. Governance and Access Controls

A single governed foundation is valuable only when access controls are designed properly.

Teams still need to define:

Which users can read transactional data
Which users can update it
Which data can be exposed to AI agents
How sensitive fields are masked
How access is audited
How long data is retained
How recovery and rollback work

This is where unified governance can become more valuable than simply reducing pipeline count.

4. Data Quality and Reconciliation

LTAP may reduce copies, but it does not remove data-quality issues.

Bad source data is still bad data.

Missing customer identifiers, incorrect product mappings, unexpected nulls, duplicate transactions, and invalid business rules still need validation.

The difference is that data quality checks can potentially be designed closer to the point where operational and analytical decisions meet.

5. Human Approval for AI Actions

As AI agents move from answering questions to recommending or triggering actions, governance becomes critical.

An agent that sees fresh customer data is useful.

An agent that can trigger a customer action, change a workflow, or make a financial recommendation without review is a governance risk.

The future architecture needs more than real-time data.

It needs:

Trusted Data
→ Validated Context
→ AI Recommendation
→ Human Review or Policy Check
→ Approved Action

That is where data engineering, governance, and AI engineering come together.

LTAP Does Not Eliminate ETL

It is important to be realistic.

There will still be ETL, ELT, streaming transformations, data modeling, quality checks, and integration work.

Organizations will continue to have:

SaaS applications
Mainframes
Third-party platforms
Vendor APIs
Legacy operational systems
Regulatory reporting requirements
Historical archives
Domain-specific data products

LTAP will not magically eliminate those realities.

But it may reduce a category of pipelines that exist only because operational and analytical environments are disconnected by default.

That is a meaningful architectural shift.

Questions I Would Ask Before Adopting LTAP

Before adopting LTAP for an enterprise use case, I would ask:

Which current pipelines exist only to synchronize operational and analytical copies?
Which workflows truly need low-latency operational plus analytical context?
Which workloads must remain isolated for performance, reliability, or compliance?
What data contracts are required before operational and analytical consumers share the same foundation?
How will schema changes be governed?
How will AI agents access transactional context safely?
What approval and audit controls are needed for agent-driven actions?
How will teams measure whether LTAP reduces cost, latency, incidents, or reconciliation effort?

These questions keep the discussion practical.

Final Thought

The most interesting part of LTAP is not that it promises fewer pipelines.

It is that it gives enterprises a new way to think about the relationship between:

Operational applications
Transactional data
Streaming data
Analytics
AI agents
Governance

For a long time, we accepted that those systems had to be connected through layers of copying, synchronization, and operational glue.

LTAP suggests that for some use cases, they can be designed around a closer and more governed foundation.

For data engineers, that does not reduce the importance of architecture.

It raises the importance of getting the architecture right.

The future will not be “one platform for everything.”

The future will be choosing the right boundary between real-time operational needs, analytical scale, governance, and human accountability.

Reading Spark UI: A Repeatable Guide to Finding Performance Bottlenecks

Ashwin_DSA — Thu, 25 Jun 2026 19:11:51 GMT

A question came up in the community recently that I thought deserved more than a short answer. The question was around how to build a reliable investigation sequence for slow Spark jobs, specifically when symptoms overlap. A long-running stage with high spill and a few slow tasks could be data skew, insufficient executor memory, too few partitions, or an inefficient join strategy. The Spark UI has all the information you need to tell them apart, but only if you know where to look and in what order.

I put together a lab notebook with three intentionally broken jobs, one per bottleneck type, ran them on a Databricks cluster with Photon disabled to expose the classic Spark signatures, and captured every screenshot from a real run. This post walks through what each bottleneck looks like in the UI, the fix, and how to confirm the fix actually worked. The goal is a sequence you can apply directly the next time a stage takes longer than it should.

Key Takeaways

Start in the Stages tab. Find the slowest stage, then ask three questions in order before touching any configuration.
Data skew, memory pressure, and underparallelism each produce a distinct Spark UI signature. The Max/Median duration ratio and the spill distribution are the fastest discriminators.
A single faster run is not validation. The underlying metric (GC time, spill, task ratio) must move, not just wall-clock time.
Enable AQE before doing any manual tuning. It resolves a large fraction of shuffle partition and broadcast join problems automatically.

What's in this post

The investigation sequence
Scenario 1: Data skew
Scenario 1b: Broadcast join fix
Scenario 2: Memory pressure
Scenario 2b: Partition fix
Scenario 3: Underparallelism
Decision map and thresholds
Validating the fix

The investigation sequence

Before clicking anything, open the Stages tab and sort by Duration descending. Pick the longest stage. Everything else is noise until you understand that one stage.

Inside the stage, you need three numbers before drawing any conclusions:

Median task duration
Max task duration
Median vs Max shuffle read size per task

The ratio of Max to Median is your first discriminator. From there, three questions applied in order will identify the primary bottleneck in the large majority of cases.

Question 1: Is Max Duration more than 5x Median? If yes, check whether shuffle read bytes are also skewed. Both conditions together indicate data skew.

Question 2: Does spill appear on most tasks (not just outliers)? Is GC time above 10% in the Executors tab? If yes, it is memory pressure.

Question 3: Is task count well below 2x your executor core count? If yes, it is underparallelism.

If none of those fit, open the SQL / DataFrame tab and look at the physical plan. Missing predicate pushdown, unexpected cross joins, or a sort-merge join where broadcast would work are the next places to look.

Scenario 1: Data skew

The lab job joins a 20 million-row fact table where 70% of rows share the same join key (key=1) against a 20-row reference table. Broadcast is explicitly disabled to force a sort-merge join, which means all 20 million rows must be shuffled and sorted by join key before the merge. The partition holding key=1 receives 14 million rows. Every other partition receives a handful.

from pyspark.sql import functions as F

spark.conf.set("spark.sql.adaptive.enabled", "false")
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")  # force sort-merge join

skew_data = (
    spark.range(20_000_000)
    .withColumn("join_key",
        F.when(F.rand() < 0.70, F.lit(1))
         .when(F.rand() < 0.85, F.lit(2))
         .otherwise((F.rand() * 18 + 3).cast("int")))
    .withColumn("value", F.rand() * 1000)
    .withColumn("payload", F.expr("repeat(cast(rand() as string), 50)"))
)

ref_data = (
    spark.range(1, 21)
    .withColumnRenamed("id", "join_key")
    .withColumn("label", F.concat(F.lit("cat_"), F.col("join_key").cast("string")))
)

result = (
    skew_data.join(ref_data, "join_key", "inner")
    .groupBy("join_key", "label")
    .agg(F.sum("value").alias("total"), F.count("*").alias("cnt"))
)
result.write.format("noop").mode("overwrite").save()

Task Metrics: the 24x ratio

Stage 2 Task Metrics. Median duration 38ms, Max 0.9s: a 24x ratio. Shuffle Read is 0 bytes on 199 of 200 partitions; one partition holds the hot key.

The Task Metrics summary table is the most important screen in the Spark UI for diagnosing skew. Two rows matter here:

Duration: Median 38ms, Max 0.9s. A 24x ratio. The 75th percentile sits at 53ms, meaning the outlier is not just slightly above average. It is categorically different from the rest of the distribution.
Shuffle Read Size: Median 0 bytes / 0 records across 199 of 200 tasks. The Max partition receives all the data. Most tasks have nothing to process.

This is the defining skew signature: a bimodal distribution where the vast majority of tasks finish in milliseconds and one task runs for orders of magnitude longer.

Event Timeline: the visual tell

Event Timeline for Stage 2. A handful of long green (Executor Computing Time) bars at the top, followed by 190+ short bars. This bimodal shape is the skew signature.

The Event Timeline converts the numeric ratio into something immediately visual. Long green bars indicate CPU time spent processing the hot key partition. The colour here matters: green is Executor Computing Time, confirming the slow tasks are CPU-bound on data processing, not waiting on I/O or network. Memory pressure and underparallelism do not produce this bimodal shape, which makes it the fastest visual discriminator between the three bottleneck types.

DAG Visualization: confirming the join strategy

DAG for Stage 2. Two Exchange (shuffle) nodes feed into two Sort operations, which merge via SortMergeJoin inside WholeStageCodegen. Both sides were fully shuffled and sorted by join key, making skew on the join key directly visible as a task outlier.

The DAG confirms the mechanism. Two Exchange nodes (one per join side) followed by Sort operations and a SortMergeJoin. This is the join strategy that makes skew dangerous: every row for key=1 lands on the same partition, and one task must process all of it alone.

Scenario 1b: Broadcast join fix

The reference table has 20 rows. It fits comfortably in executor memory. Enabling broadcast replicates it to every executor, eliminating the need to shuffle the join key entirely. The SortMergeJoin and its two Exchange stages disappear from the plan.

# Re-enable broadcast: small table replicated to every executor, no shuffle on join key
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", str(10 * 1024 * 1024))

result = (
    skew_data.join(F.broadcast(ref_data), "join_key", "inner")
    .groupBy("join_key", "label")
    .agg(F.sum("value").alias("total"), F.count("*").alias("cnt"))
)
result.write.format("noop").mode("overwrite").save()

Jobs comparison: the numbers

Jobs tab. Scenario 1 (no broadcast): 18s, 4 stages, 408 tasks. Scenario 1b (broadcast): 5s, 2 stages, 204 tasks. Half the stages, half the tasks, 3.6x faster.

The improvement is not incremental. The fix collapses two of the four stages entirely. In production, where a skewed sort-merge join might anchor a 30-minute stage, this is the difference between a job completing before business hours and one that misses its SLA.

Stages after the fix

Scenario 1b Stages. Two stages: a 4-task scan and a 200-task aggregation shuffle. The 200-task shuffle join stage that produced the 24x task ratio is absent.

The fix did not make the slow task faster. It eliminated the stage that contained the slow task. That is a fundamentally different kind of improvement and it is the one to aim for with skew. If the data allows it, removing the shuffle is better than optimizing within it.

When broadcast is not an option: if the smaller side of the join is too large to broadcast (typically above 500MB to 1GB depending on executor memory), the next tool is salting. Add a random suffix to the hot key before the join to distribute its rows across multiple partitions, then strip the suffix in a second pass. AQE's skew join optimization (spark.sql.adaptive.skewJoin.enabled) automates a version of this when the skewed partition exceeds the configured threshold.

Scenario 2: Memory pressure

The lab job processes 4 million wide rows, each approximately 400 bytes of string data across five columns, shuffled into only 20 partitions. Each task receives roughly 8 MB of data. A collect_list aggregation forces each task to hold the full list of strings in heap before writing the result. The combination of large per-task data and an in-memory accumulator produces GC pressure across all tasks.

spark.conf.set("spark.sql.shuffle.partitions", "20")  # intentionally too few

wide_data = (
    spark.range(4_000_000)
    .withColumn("group_key", (F.col("id") % 20).cast("int"))
    .withColumn("col_a", F.expr("repeat(cast(rand() as string), 80)"))
    .withColumn("col_b", F.expr("repeat(cast(rand() as string), 80)"))
    .withColumn("col_c", F.expr("repeat(cast(rand() as string), 80)"))
    .withColumn("col_d", F.expr("repeat(cast(rand() as string), 80)"))
    .withColumn("col_e", F.expr("repeat(cast(rand() as string), 80)"))
    .withColumn("metric", F.rand() * 1000)
)

result = (
    wide_data
    .repartition(20, "group_key")
    .groupBy("group_key")
    .agg(
        F.sum("metric").alias("total"),
        F.collect_list("col_a").alias("all_a"),  # forces large in-memory buffer
        F.count("*").alias("cnt")
    )
)
result.write.format("noop").mode("overwrite").save()

Stages: large shuffle volume, few tasks

Scenario 2 Stages. Stage 8: 20 tasks, 173.5 MiB shuffle read, 12 seconds. Compare to Scenario 1's Stage 2: 200 tasks, 7 KiB shuffle read, 9 seconds. The absolute shuffle volume flags the problem before you even click in.

The 173.5 MiB total shuffle read across 20 tasks means each task processes approximately 8.7 MiB of serialized data, before accounting for the deserialized in-memory representation which is larger. This is where the memory constraint originates.

Task Metrics: GC time is the signal

Stage 8 Task Metrics. Duration: Median 2s, Max 9s (4.5x ratio, much tighter than skew's 24x). GC Time: Max 4s on a 9s task, 44% of task time in garbage collection. Shuffle Read Median is 8.3 MiB / 200k records per task, confirming large uniform per-task data volumes.

Two things separate this from skew:

Duration ratio: Max/Median is 4.5x, compared to 24x in Scenario 1. All the heavy tasks are slow together, not one outlier dragging the stage.
GC Time: Max 4 seconds on a 9-second task. That is 44% of task time in garbage collection. The JVM is constantly reclaiming the large string arrays built by collect_list. This is the clearest memory pressure indicator available in the Spark UI.

Note there is no Spill row in this screenshot. On this cluster with 10.7 GiB executor memory, the data stays in heap but causes severe GC pressure. In production with smaller executor memory relative to partition size, the same root cause produces disk spill instead. The fix is identical in both cases: reduce per-task data volume.

Executors tab note: on a multi-executor cluster, check the Executors tab. The GC Time column shows cumulative GC per executor. Uniformly high GC across all executors points to a partition sizing problem. GC concentrated on specific executors points to uneven data distribution.

Scenario 2b: Partition fix

The fix raises shuffle partitions from 20 to 400 and removes the collect_list aggregation. Each task now receives a much smaller data chunk, and no large in-memory buffers are built.

spark.conf.set("spark.sql.shuffle.partitions", "400")

result = (
    wide_data
    .repartition(400, "group_key")
    .groupBy("group_key")
    .agg(F.sum("metric").alias("total"), F.count("*").alias("cnt"))
    # collect_list removed: no large in-memory accumulator
)
result.write.format("noop").mode("overwrite").save()

Scenario 2b Stages. Stage 10: 400 tasks, 39.0 MiB shuffle read (down from 173.5 MiB), 6 seconds (down from 12 seconds).

Scenario 2b Task Metrics. GC Time: 0ms across all percentiles including Max. Duration: Median 10ms, Max 0.2s. The root cause is eliminated, not just reduced.

Metric	20 partitions	400 partitions
Stage duration	12s	6s
Median task duration	2s	10ms
Max task duration	9s	0.2s
Max GC Time	4s (44% of task)	0ms
Total shuffle read	173.5 MiB	39.0 MiB

GC time dropping to zero is the validation signal. Not the wall-clock improvement (which is real), but the fact that the JVM has nothing to collect. The objects that were triggering pressure no longer exist at that size in heap.

Scenario 3: Underparallelism

The lab job runs a 3 million-row aggregation after repartitioning to 4 partitions. There is no skew and no spill. The cluster is simply not given enough tasks to use its available cores.

spark.conf.set("spark.sql.shuffle.partitions", "4")  # intentionally too low

under_data = (
    spark.range(3_000_000)
    .withColumn("category", (F.rand() * 100).cast("int").cast("string"))
    .withColumn("value", F.rand() * 500)
)

result = (
    under_data
    .repartition(4)  # 4 tasks regardless of cluster size
    .groupBy("category")
    .agg(F.avg("value").alias("avg_val"), F.count("*").alias("cnt"))
    .orderBy("cnt", ascending=False)
)
result.write.format("noop").mode("overwrite").save()

# Fix: raise partitions to match data volume
spark.conf.set("spark.sql.shuffle.partitions", "100")
result = (
    under_data.repartition(100)
    .groupBy("category")
    .agg(F.avg("value").alias("avg_val"), F.count("*").alias("cnt"))
    .orderBy("cnt", ascending=False)
)
result.write.format("noop").mode("overwrite").save()

Scenario 3 Stages. Three stages, all with 4/4 tasks. On a production cluster with 32 or 64 cores, the same configuration leaves the vast majority of the cluster idle throughout the job.

The underparallelism tell in the Stages tab is the task count. If every stage runs a small, fixed number of tasks regardless of data volume, the shuffle partition count is almost certainly the constraint. Check spark.sql.shuffle.partitions and compare it to 2x your executor core count as a starting floor.

Unlike skew and memory pressure, underparallelism produces no spill, no GC pressure, and a tight Max/Median ratio. All tasks run cleanly. The job simply uses a fraction of available parallelism, so it takes proportionally longer than it needs to.

Scenario 3b Stages. 100 tasks in the aggregation stage, 94 in the sort stage. On a production cluster with 32 cores, the 4-task version wastes 28 cores per wave. The 100-task version keeps the cluster busy and completes in roughly 1/8 the wall-clock time.

Decision map and practical thresholds

Symptom	Root cause	First action
Max/Median > 5x, shuffle read skewed	Data skew	Broadcast join if small side fits; salting if not
Spill on most tasks or GC > 10%	Memory pressure	More partitions before adding executor memory
Task count < 2x executor cores	Underparallelism	Raise `spark.sql.shuffle.partitions` or add `repartition()`
None of the above	Plan problem	SQL tab: check for cross joins, missing predicate pushdown, wrong join strategy

Thresholds experienced teams use

Skew threshold: any task reading more than 3x the median shuffle bytes warrants investigation. AQE's skew join optimization fires at 256MB by default. Lowering it to 64MB catches problems earlier: spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=67108864
Spill threshold: any disk spill is worth addressing. Even 100MB of spill indicates the executor memory-to-partition-size ratio is wrong.
GC threshold: GC time above 10% of task time in the Executors tab means memory is a primary constraint. Above 20%, it is the dominant cause of slowness.
Broadcast threshold: spark.sql.autoBroadcastJoinThreshold defaults to 10MB. In practice, tables up to 500MB to 1GB broadcast safely on modern clusters if executor memory is adequate. Check the physical plan in the SQL tab to confirm which join strategy Spark chose.
Partition sizing: target 128 to 256MB of input data per partition for shuffle-heavy stages. For compute-intensive stages with complex UDFs, target 64MB to avoid GC pressure from large in-memory objects.

Enable AQE first: set spark.sql.adaptive.enabled=true before doing any manual tuning. AQE resolves the majority of shuffle partition count and broadcast join problems automatically. Use the investigation sequence above for what AQE does not fix: severe skew on low-cardinality keys, memory pressure from large individual task sizes, and first-stage underparallelism before AQE has seen statistics.

Validating the fix

A single faster run is not enough. Here is what to check after applying a fix:

The metric moved, not just the clock. If you salted for skew, confirm the Max/Median ratio dropped below 2x. If you increased partitions for memory pressure, confirm GC time dropped to zero or near zero, not just reduced. Wall-clock improvement with no change in the underlying metric means something else is limiting performance.
Spill is zero, not reduced. Spill dropping from 10GB to 2GB means you improved the memory-to-partition ratio but did not fully resolve it. The correct partition size produces zero spill.
Run on a cold cache. The second run of a job often benefits from cached shuffle files from the first run. Force a fresh run by clearing cache or changing the input path, otherwise the improvement may be partially artificial.
Run at full production data volume. Skew and memory pressure are data-volume-dependent. A fix that works on a 10% sample frequently fails at full volume because the skewed key's frequency is nonlinear and per-partition data volume changes.
Check across a time window. For recurring jobs, pull 7 days of run history after applying the fix and confirm duration variance dropped. A job that averages fast but spikes 3x on weekends still has a skew or partition imbalance problem the average obscures.

Hope this is useful for diagnosing your slow stages. If you have additional techniques or thresholds that work well in your environment, please share them in the comments. These patterns improve with more data points.

DataFlint on Databricks - the Open Source Spark UI Upgrade Apache Spark Has Needed for Years

szymon_dybczak — Wed, 24 Jun 2026 08:21:16 GMT

Introduction

Apache Spark has become one of the most widely adopted engines for large-scale data processing. Its appeal is easy to understand: it supports batch processing, streaming workloads, feature engineering, machine learning pipelines, and large-scale analytical transformations across nearly every major data platform. It gives teams a powerful and flexible way to process data at massive scale.

But that power comes with complexity. Because Spark is a distributed computing engine, its behavior is not always easy to reason about when something stops working as expected. A simple symptom, such as a slow pipeline or a failed job, can be caused by many different things: skewed data, inefficient joins, shuffle bottlenecks, memory pressure, spilling, poor partitioning, executor failures, configuration issues, or an unexpected change in the physical plan.

The Spark UI contains a huge amount of useful information, but making sense of it is not straightforward. Relevant details are scattered across multiple tabs. Each tab gives you part of the picture, but rarely the full story.

The Spark UI itself also feels dated in some areas. When tracking a long-running query, you often need to refresh the page manually to see progress. To understand what is happening, you have to jump between multiple tabs and mentally connect information that is presented separately.
As a result, Spark debugging can become cognitively demanding very quickly. The hard part is not just finding metrics, but understanding which metrics matter, how they relate to each other, and what conclusion can be drawn from them. For many teams, getting from “this job is slow” to “this is the actual bottleneck” still requires a lot of experience, patience, and manual investigation

This is exactly the gap DataFlint aims to close.

What is DataFlint OSS?

DataFlint OSS is an open-source monitoring and debugging plugin for Apache Spark. It does not replace the Spark UI. Instead, it extends it by adding a dedicated DataFlint tab inside each Spark application.

The goal is simple: make Spark performance easier to understand. DataFlint does this by going beyond raw execution data. It highlights patterns that often indicate performance problems, such as data skew, inefficient resource usage, small files, large partitions, problematic joins, or suspicious executor behavior. These findings appear as alerts, helping you move from “something is slow” to a likely explanation much faster.

DataFlint also adds a more modern interface for exploring Spark applications, including run summaries, stage breakdowns, heat maps, syntax highlighting, and optional instrumentation for detailed operator-level timing. We will look at these features in the demo section, but the key idea is this: DataFlint makes it much easier to understand what happened in a Spark job and where to focus your attention.

How it works?

DataFlint is built on top of Apache Spark’s official plugin API. When a Spark application starts, Spark loads SparkDataflintPlugin through the normal plugin lifecycle. The plugin then creates a driver-side component, SparkDataflintDriverPlugin, and Spark calls its init() method during driver startup, before any jobs begin running.

This is important - DataFlint uses native extension mechanism Spark provides for instrumentation and runtime integrations.

The initialization flow has two main steps:

First, during init(), DataFlint can register a SQL extension called DataFlintInstrumentationExtension. This only happens when at least one instrumentation option is explicitly enabled. When enabled, the extension modifies Spark SQL physical plans by wrapping selected operators with timing nodes. These wrappers collect wall-clock duration metrics for parts of the query plan that the native Spark UI does not expose in the same level of detail.
Second, after init() completes, Spark calls registerMetrics(). At this point, DataFlint installs its Web UI tab, registers the REST endpoints used by the frontend, serves the bundled React single-page application, and attaches event listeners to Spark’s listener bus.

The result is a modern single-page application embedded directly into the existing Spark Web UI. It can poll live application metrics, update without full page refreshes, and run without requiring a separate service outside the Spark driver process.

Installation

There are two main ways to install DataFlint on Databricks:

Install it directly from a notebook (super easy):
https://dataflint.gitbook.io/dataflint-for-spark/getting-started/install-on-databricks#install-on-databricks-from-a-notebook
Install it as a Spark plugin on a Databricks cluster, which is the recommended approach.

Since the second option is recommended, let’s walk through that setup.

Before creating the init script, we need a location where the script can be stored. Databricks allows us to use either a Unity Catalog volume or workspace files for this purpose.

In this example, we will use a Unity Catalog volume for two reasons. First, Unity Catalog volumes are the recommended option when using DBR 13.3 LTS or later with Unity Catalog enabled. Second, workspace files are not supported on clusters running in standard access mode( formerly known as shared access mode).

To create a volume, we can use the following command:

%sql
CREATE VOLUME databricks_demo_ws.default.demo_volume;

Once the volume is ready, the next step is to prepare the init script. Open the DataFlint documentation and copy the init script that matches your Spark version. In my case, the cluster runs on Apache Spark 4.0, so I selected the corresponding DataFlint script.

Save the script content as init_script.sh, and upload it to the Unity Catalog volume.

One important thing to keep in mind is that standard access mode requires an administrator to add init scripts to the allowlist. Without this step, the cluster will fail to start and return an error similar to the following:

To handle this, open Unity Catalog Explorer and follow these steps:

In your Databricks workspace, click Catalog.
Click the gear icon.
Click the metastore name to open the metastore details and permissions page.
Select Allowed JARs/Init Scripts.
Click Add init script and provide the correct path to your script.

Once the init script has been added to the allowlist, go to your compute configuration. Open the Advanced section, then go to Init scripts. Choose Volumes as the source and provide the full path to the uploaded init_script.sh file.

Great, at this step we’re ready for installation. During my first attempt, the cluster failed to start because of a small typo in the init script that was available in the DataFlint documentation at the time.

After a short debugging session, I corrected the script locally and flagged the issue to the DataFlint team. They responded very quickly, and the documentation issue has already been fixed.

I’m leaving this note here for transparency, and also as a reminder that when working with init scripts, even a very small typo can prevent the cluster from starting.

Below is the corrected version you can copy and paste:

DATAFLINT_VERSION="0.9.9"
SPARK_DEFAULTS_FILE="/databricks/driver/conf/00-custom-spark-driver-defaults.conf"

mkdir -p /databricks/jars/

wget --quiet \
  -O /databricks/jars/dataflint_spark4-databricks_2.13-$DATAFLINT_VERSION.jar \
  https://repo1.maven.org/maven2/io/dataflint/dataflint-spark4-databricks_2.13/$DATAFLINT_VERSION/dataflint-spark4-databricks_2.13-$DATAFLINT_VERSION.jar

if [[ $DB_IS_DRIVER = "TRUE" ]]; then
  mkdir -p /mnt/driver-daemon/jars/
  cp /databricks/jars/dataflint_spark4-databricks_2.13-$DATAFLINT_VERSION.jar /mnt/driver-daemon/jars/dataflint_spark4-databricks_2.13-$DATAFLINT_VERSION.jar
  echo "[driver] {" >> $SPARK_DEFAULTS_FILE
  echo "  spark.plugins = io.dataflint.spark.SparkDataflintPlugin" >> $SPARK_DEFAULTS_FILE
  echo "}" >> $SPARK_DEFAULTS_FILE
fi

If you author init_script.sh on a Windows machine, your editor will very likely save it with Windows (CRLF) line endings (\r\n). The Databricks driver runs Linux, where the shell expects Unix (LF) line endings (\n).
A script saved with CRLF will fail in confusing ways - the trailing carriage return gets attached to the last token on each line, so you'll see errors such as:

/bin/bash: line 2: $'\r': command not found

How to avoid it:

In VS Code, click the CRLF indicator in the bottom-right status bar and switch it to LF, then save.

This single issue comes up surprisingly often in Databricks Community threads, so I think it’s worth mentioning here.

DataFlint Demo: Seeing It in Action

Once the installation is complete, we can take a closer look at what DataFlint actually adds to the Spark debugging experience.

To make the walkthrough more concrete, let’s run a sample analytical query against the TPC-DS dataset and then open the DataFlint tab in the Spark UI:

SELECT
    w.w_warehouse_name,
    it.i_category,
    it.i_class,
    COUNT(DISTINCT i.inv_item_sk) AS distinct_items,
    COUNT(DISTINCT i.inv_date_sk) AS inventory_dates,
    COUNT(*) AS inventory_rows,
    SUM(i.inv_quantity_on_hand) AS total_quantity_on_hand,
    AVG(i.inv_quantity_on_hand) AS avg_quantity_on_hand,
    MIN(i.inv_quantity_on_hand) AS min_quantity_on_hand,
    MAX(i.inv_quantity_on_hand) AS max_quantity_on_hand
FROM samples.tpcds_sf1000.inventory i
JOIN samples.tpcds_sf1000.warehouse w
    ON i.inv_warehouse_sk = w.w_warehouse_sk
JOIN samples.tpcds_sf1000.item it
    ON i.inv_item_sk = it.i_item_sk
GROUP BY
    w.w_warehouse_name,
    it.i_category,
    it.i_class
ORDER BY
    total_quantity_on_hand DESC;

For the demo, we will follow a typical investigation flow: start with the high-level symptoms, check whether the cluster was used efficiently, inspect the SQL plan, let alerts point us to likely issues, and only then enable deeper instrumentation if we need more precise timing.

Start with the Summary page to understand the workload at a high level.
Check cluster resources to see how executors, memory, and cores behaved during the run.
Inspect the SQL plan to identify expensive operators and heavy parts of the query.
Use Alerts to jump directly to suspicious patterns instead of manually hunting through metrics.
Enable instrumentation when you need more precise operator-level timing.

Step 1: Start with the Summary Page

The Summary page gives a high-level view of how your query used cluster resources during execution. Instead of immediately jumping between Spark UI tabs such as Jobs, Stages, Executors, and SQL, DataFlint brings the most important performance indicators into one place.

This makes the Summary page a useful first checkpoint. It helps you quickly understand whether a Spark job was efficient, over-provisioned, memory-constrained, shuffle-heavy, or affected by spill operations.

At the top of the page, DataFlint shows metrics such as Duration, DCU, Input, Output, Memory Usage, Shuffle Read, Shuffle Write, Spill to Disk, Idle Cores, and Task Error Rate. Together, these metrics describe both the cost and behavior of the workload.

One metric that deserves a short explanation is DCU, which stands for DataFlint Compute Units. DCU is DataFlint’s measurement unit for Spark usage, similar in concept to a Databricks Unit, or DBU. It combines CPU and memory allocation into a single usage metric.

The formula is:

DCU = (Core/Hour usage * 0.05) + (GiB Memory/Hour usage * 0.005)

Core/Hour: is the number of cores allocated for your app in hours measurement.

GiB Memory/Hour: is the number of memory in GiB units allocated for your app in hours measurement

Another useful detail is that the Summary page refreshes automatically in real time. Unlike the standard Apache Spark Web UI, you do not need to manually refresh the page to see updated metrics while the application is running. This makes live monitoring much more comfortable, especially when you want to watch resource usage, memory pressure, shuffle volume, spill, or task failures as they happen.

Step 2: Check whether the cluster was used efficiently

Next, let’s look at the Resources page. This page gives a more focused view of how Spark resources are allocated and used during the job.

The Executors Timeline makes dynamic allocation easy to understand visually. In this run, Spark started with 1 executor, and later scales to 2 executors. The configuration table below the chart shows the executor and driver resources, such as cores and memory, together with dynamic allocation settings like minimum and maximum executors.
DataFlint puts the resource timeline and configuration details in one place, which makes it easier to connect workload behavior with cluster behavior.

Step 3: Inspect the SQL Plan

After checking the overall workload and cluster behavior, we can move from symptoms to execution details. The Summary page contains a list of executed and currently running queries. When you click one of them, DataFlint opens an interactive SQL execution plan graph.

For example, clicking query ID 10 opens a graph view of the query plan. The plan can be viewed in three modes:

I/O Only: input/output scan and write nodes.
Basic: the main transformations.
Advanced: every node in the plan.

Each node shows the operator name, such as Filter, Exchange, or FileScan, together with key metrics like output rows, shuffle bytes, spill size, partitions, and table name for scans.

A really useful feature is the performance heat bar at the top of each node. It uses green, orange, and red to show how much of the total query time was spent in that operator.

There is also a MiniMap in the lower-left corner that directly complements the heat map. While the main graph lets you zoom in on individual nodes, complex queries can have plans that are too large to view in full at once. The MiniMap gives you a bird’s-eye view of the entire plan, with the same heat map coloring applied, so red nodes remain visible even when they are off-screen.

Together, the heat map and MiniMap let you quickly locate the most expensive operator - and that’s super cool.

Some nodes can also display additional badges:

A green flag badge indicates that the node is instrumented by TimedExec and has precise wall-clock timing.
A rocket badge indicates that the operator is running on a native accelerator such as Gluten, Comet, or Photon.
An alert badge indicates that DataFlint detected a potential issue on that node.

Shuffle operations are especially nice in this view. DataFlint splits Exchange nodes into two separate half-nodes: shuffle write and shuffle read. Each side has its own metrics and stage association, which makes it much easier to see where shuffle cost is coming from.

The plan view also includes a stage grouping toggle. When enabled, nodes that run in the same Spark stage are visually enclosed in a stage container. Clicking a stage opens a side drawer with stage-level details.
This is another feature with no equivalent in Spark’s native UI. In Apache Spark, a query’s execution plan and its stages are two completely separate concepts - the SQL tab shows you the plan tree, and the Stages tab shows you a flat list of stages, with no visual connection between them. DataFlint bridges this gap by overlaying stage boundaries directly onto the plan graph.

Finally, there is a duration mode toggle. You can switch between exclusive duration, which means time spent in that operator only, and inclusive duration, which means time spent in the operator and all of its children.

The bottom toolbar contains useful navigation shortcuts:

Speed: cycles through nodes ordered by duration percentage, highest first.
Warning: cycles through nodes with alerts.
Storage: cycles through nodes with spill, ordered by spill size.
Fit view / zoom controls: help you navigate large plans.

Step 4: Let Alerts Point to the Problem

So far the Summary and Resources pages have helped us observe what a job did. The Alerts page is where DataFlint goes a step further and starts reasoning about it. Instead of leaving you to interpret the metrics yourself, it continuously inspects every query, stage, and executor against a set of built-in heuristics and surfaces the ones that look wrong - each one written up as a short, plain-English finding with a concrete suggestion on how to fix it.

This is probably the single biggest difference from the native Spark UI. The native UI will show you that one task in a stage ran for 28 seconds while the others finished in a second but it will never tell you “this is data skew, and here’s what to do about it.”

You need to know where to look, which numbers to compare, and what the comparison means. The Alerts page encodes that experience for you.

When you open the tab, alerts are grouped by type and tagged as either a warning (yellow) or an error (red), with a running count of each at the top. Every alert card has a “Go to alert” button that jumps straight to the exact SQL node, stage, or resource the finding is about, so you never have to hunt for where the problem lives.

A concrete example: data skew

The TPC-DS sample tables I’ve used so far are nicely balanced, so they will not show a strong skew problem on their own. To make the issue visible, we need to manufacture one. For the synthetic example, imagine a multi-tenant event platform where most tenants are small, but one enterprise tenant generates almost all of the traffic. This is something you can encounter in real systems: one customer, account, region, or tenant dominates the data distribution.

The setup is simple. Most tenants produce only a small number of events and have one routing rule. But `tenant_enterprise_001` produces around 97% of all events and has 100 routing rules. Then we run a reasonable-looking analytics query: join events to routing rules and summarize how many events each rule matched.

Before running the query, I also move a few Spark optimizations out of the way. I disable broadcast joins so Spark cannot simply broadcast the rules table and avoid the shuffle. I also disable Adaptive Query Execution so Spark’s automatic skew handling does not rescue the query before DataFlint has anything interesting to show.
In production, AQE is usually something you want enabled. For a demo like this, though, turning it off lets the skew surface clearly.

from pyspark.sql import functions as F

spark.conf.set("spark.sql.adaptive.enabled", "false")
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
spark.conf.set("spark.sql.join.preferSortMergeJoin", "true")


N = 10_000_000

events_df = (
    spark.range(0, N, 1, numPartitions=128)
    .select(
        F.when(F.rand(seed=42) < 0.97, F.lit("tenant_enterprise_001"))
         .otherwise(
             F.concat(
                 F.lit("tenant_"),
                 F.lpad((F.col("id") % 20_000).cast("string"), 5, "0")
             )
         ).alias("tenant_id"),

        F.col("id").alias("event_id"),
        F.concat(F.lit("session_"), (F.col("id") % 2_000_000).cast("string")).alias("session_id"),
        F.sha2(F.col("id").cast("string"), 256).alias("payload")
    )
)

events_df.createOrReplaceTempView("demo_events")

normal_rules_df = (
    spark.range(1, 20_001, 1, numPartitions=32)
    .select(
        F.concat(
            F.lit("tenant_"),
            F.lpad(F.col("id").cast("string"), 5, "0")
        ).alias("tenant_id"),
        F.lit(1).alias("rule_id"),
        F.lit("standard_rule").alias("rule_type")
    )
)

enterprise_rules_df = (
    spark.range(1, 101, 1, numPartitions=4)
    .select(
        F.lit("tenant_enterprise_001").alias("tenant_id"),
        F.col("id").alias("rule_id"),
        F.concat(F.lit("enterprise_rule_"), F.col("id").cast("string")).alias("rule_type")
    )
)

rules_df = normal_rules_df.unionByName(enterprise_rules_df)

rules_df.createOrReplaceTempView("demo_routing_rules")


skew_query = """
SELECT
    r.rule_type,
    COUNT(*) AS matched_events,
    COUNT(DISTINCT e.session_id) AS unique_sessions,
    MIN(e.event_id) AS first_event_id,
    MAX(e.event_id) AS last_event_id
FROM demo_events e
JOIN demo_routing_rules r
  ON e.tenant_id = r.tenant_id
GROUP BY r.rule_type
ORDER BY matched_events DESC
"""

display(spark.sql(skew_query))

After the query finishes, open the Alerts page in DataFlint. You should see a Partition Skew warning for the join stage.

If you click Go to Alert, you will be redirected straight to the location in the physical plan where the problem occurs. This is a super nice feature.

It is worth to explain why DataFlint treats this as skew rather than normal variation. The alert is based on the task-duration distribution within a stage. DataFlint compares the slowest task with the median task and raises a warning only when the difference is large enough to matter. That avoids noisy alerts for tiny stages where one task being a bit slower is irrelevant.

In this example, the warning is expected because the workload is intentionally unbalanced. The median task processes a small tenant-sized partition, while the worst task processes the enterprise tenant and its 100-rule fan-out. That is exactly the scenario skew alerts are meant to make obvious.

Another common offender: small files

Skew is about time; the next example is about layout. The problem here is a table written as thousands of tiny files instead of a few large ones - when you read it back, Spark spends more effort opening files and scheduling tasks than actually processing data. We can reproduce it intentionally by spreading a small amount of data across far too many output files:

-- Force 5,000 output files for a tiny dataset -> a few hundred rows
-- (a few KB) per file.
CREATE TABLE default.tiny_files AS
SELECT /*+ REPARTITION(5000) */ id, rand() AS v
FROM range(0, 1000000);

A million rows split across 5,000 files is roughly 200 rows - a few kilobytes per file. Simply reading the table back triggers the Reading Small Files warning:

SELECT SUM(v)
FROM default.tiny_files;

The heuristic here is simple: DataFlint divides the bytes read by the number of files read for a scan, and if the average file is smaller than a few megabytes and the scan touched more than a hundred files, it flags it. There’s a matching alert on the write side too - if your job is the one producing the small files, DataFlint will point that out and even tailor the advice to whether the output is partitioned.

Step 5: Experimental feature: DataFlint Spark Instrumentation

DataFlint provides optional instrumentation that enhances Spark observability. It injects extra metrics and metadata into the Spark UI that Spark does not expose on its own. All instrumentation is opt-in and disabled by default, so nothing about your query planning changes unless you explicitly turn it on.

The mechanism is worth understanding before we use it. When any instrumentation flag is enabled, DataFlint registers a Spark SQL extension during driver startup. That extension hooks into Spark’s physical planning phase and wraps selected operators with a lightweight timing node. The wrapper is transparent - it shows up as a single node in the plan graph (its name simply gets a DataFlint prefix), it keeps all of the operator's original metrics, and it adds one new metric: duration, the wall-clock time that operator actually spent doing work.

Instrumentation is split into granular flags so you can enable just what you need. The two we’ll look at here are Window instrumentation and SQL nodes instrumentation:

# enable only window timing
.config("spark.dataflint.instrument.spark.window.enabled", "true")

# enable timing for the common SQL operators (filters, joins, scans, aggregates, ...)
.config("spark.dataflint.instrument.spark.sqlNodes.enabled", "true")

# or turn everything on at once
.config("spark.dataflint.instrument.spark.enabled", "true")

On Databricks you’d add the same keys to your cluster’s Spark config.

Window instrumentation

Window instrumentation wraps Spark’s WindowExec so you can see how long the window computation took, right on the plan node.

Let’s try it with what below dummy query:

SELECT
    c_customer_sk,
    c_last_name
FROM samples.tpcds_sf1.customer
QUALIFY row_number() OVER (PARTITION BY c_last_name ORDER BY c_customer_sk DESC) = 1;

We enable spark.dataflint.instrument.spark.window.enabled, run the query, open the SQL plan... and, weirdly, there is no window operator to be found. What happened?

After some investigation it turns out this query never uses a plain WindowExec operator at all. Because the window is immediately filtered by QUALIFY, Spark applies an optimization introduced in [SPARK-37099] - a dedicated physical operator called WindowGroupLimit.

The idea behind SPARK-37099 is as follows: for rank-style functions such as row_number, rank, and dense_rank, the rank of a key computed on a partial dataset is always less than or equal to its final rank over the full dataset. This means Spark can safely discard rows whose partial rank already exceeds k, before the expensive shuffle and window processing take place. To do this, Spark inserts a per-window-group limit both before and after the shuffle.

As a result, the window execution time we intended to measure is now captured inside WindowGroupLimit, which is not covered by the window flag.

Let’s remove the filtering from our query.

SELECT
    c_customer_sk,
    c_last_name,
    row_number() OVER (PARTITION BY c_last_name ORDER BY c_customer_sk DESC)
FROM samples.tpcds_sf1.customer

After that, we should be able to see the window instrumentation. Great - we’re learning DataFlint and Spark internals at the same time! And as you can see -> Window instrumentation worked this time.

And good news: I asked the DataFlint team about WindowGroupLimitExec support, and they confirmed that it is already planned for the next release. So soon we should have instrumentation for this operator as well - nice.

SQL nodes instrumentation

SQL nodes instrumentation casts a much wider net. Instead of a single operator family, it wraps the common physical operators that make up most query plans - filters, projections, joins, sorts, hash/sort aggregates, the file and batch scans, the write command.

When enabled, DataFlint’s DataFlintInstrumentationExtension wraps each SQL physical operator with a TimedExec node that measures actual wall-clock execution time per operator. The result is a duration metric on every
instrumented node - not an estimate derived from task metrics, but a direct measurement of how long that specific operator spent processing data. This is what powers the heat map: without instrumentation, duration percentages are approximated from stage-level data; with instrumentation, every node carries its own precise timing.

On the screen below, you can see that after enabling SQL node instrumentation, wrapped operators are visible even in the native UI.

The practical impact is significant. Consider a stage containing a SortMergeJoin followed by a Filter followed by a Project. At the stage level, they all look the same — part of a 3-minute stage. With instrumentation, you might discover the join consumed 2 minutes 50 seconds, and the filter ran in under a second. That distinction is the difference between tuning the right thing and tuning the wrong thing.

Instrumentation is intentionally opt-in - it rewrites the physical query plan, which carries a small overhead and a compatibility risk with native accelerators like Gluten or Comet. But for workloads where you need to understand performance at operator granularity rather than stage granularity, it transforms the plan view from a structural diagram into a genuine performance profile.

Conclusion

DataFlint is an excellent addition to the Spark ecosystem. It makes day-to-day debugging much easier by bringing the most important pieces of information into one place, highlighting suspicious patterns, and helping you move from “something is slow” to “this is probably why” much faster.

I have already started using DataFlint in my daily workflow, and it has made Spark performance investigation feel much less painful. If you work with Spark regularly, I definitely recommend giving it a try.

You can check out the project on GitHub, and if you find it useful, consider giving the repository a star. It’s a simple way to support the project and help more Spark users discover it.

Also, if you are interested in Spark optimization more broadly, I highly recommend checking out the DataFlint YouTube channel and the Big Data Performance Substack, which is run by one of DataFlint’s founders. Both are great resources for learning more about Spark performance, debugging, and optimization.

And finally, if you want to help improve the Spark debugging experience for everyone, consider contributing to the project.

A small song for the Databricks Community

Brahmareddy — Tue, 23 Jun 2026 17:55:42 GMT

One Question Can Light the Spark

My Databricks journey started in 2022 with simple interest, curiosity, and a dream to learn more.

At that time, I was just trying to understand the platform, follow the updates, learn from others, and slowly build my confidence. Like many learners, I had questions, doubts, and moments where I did not know where to start.

Over time, the Databricks Community became more than a place to read posts. It became a place to learn, ask, share, and grow.

Every question taught me something.
Every answer gave me direction.
Every discussion opened a new idea.
Every contributor inspired me to give back.

From that first interest in 2022 to attending the Data + AI Summit in 2026, this journey has been very special to me. Seeing the Databricks Community in action, meeting people, learning from leaders, and feeling the energy of this ecosystem made me even more grateful.

That is why I created this song for the Databricks Community.

I created it with love, respect, and gratitude for every visitor, learner, contributor, champion, and leader who makes this community meaningful.

Many people come here with a simple question. Some come with errors. Some come with ideas. Some come with big dreams. And this community gives them a place to start.

One helpful answer can give someone confidence.
One shared experience can save someone hours.
One kind reply can make someone feel welcome.
One contribution can inspire many more.

That is the real beauty of the Databricks Community.

Visitors become learners.
Learners become contributors.
Contributors become champions.
Champions inspire the next generation.
And community leaders continue to build a space where everyone can feel included.

I truly love Databricks and this community. I would love to keep helping in the best way I can, by learning, sharing, supporting others, and inspiring more people to participate.

This song is my small tribute to all of us.

One question can light the spark.
One answer can guide the heart.
Together, we learn.
Together, we share.
Together, we grow.

Thank you, Databricks Community.

Databricks CustomerLake

ShivamKumar7788 — Mon, 22 Jun 2026 11:28:42 GMT

Databricks CustomerLake: Inside the Agentic CDP Built for the Age of AI

A deep dive into what CustomerLake actually is, how it works, and what it looks like in practice.

At Data + AI Summit 2026, Databricks announced CustomerLake — an Agentic Customer Data Platform built natively inside the Databricks Lakehouse.

Not a standalone tool. Not a separate layer bolted on top. The thinking behind it is simple: rather than pulling customer data out into yet another platform, bring the CDP capabilities directly into the environment where that data already lives — with governance, security, and data infrastructure already in place.

This post walks through what CustomerLake covers, how it works, and what the product actually looks like — drawing from the official keynote, product demo, and launch materials.

The Problem CustomerLake Is Solving

Marketing at most enterprises still follows a familiar sequence. A plan gets defined. Data teams pull together what is needed. Audiences are assembled. A campaign gets configured in some automation tool, pushed out, and then measured. Rinse and repeat.

The cycle has worked well enough — but it moves slowly. Building and refining campaigns typically runs across weeks, sometimes months. And the output, despite all the effort, tends to be broad. The same message going out to large groups, not genuinely tailored to individual customers.

Meanwhile, the buying side is evolving fast. AI agents are now doing research, comparing options, and making purchases on behalf of consumers — always available, reacting to new context almost immediately, operating across a growing number of channels. Marketing built around weekly batch cycles does not keep pace with that.

The Concept: Infinity Campaigns

CustomerLake is built around a core idea Databricks calls Infinity Campaigns.

Traditional campaigns are time-boxed — they start, run to a predefined audience, and end. Infinity Campaigns work differently. They are continuous engagement loops, always running, with no fixed end state. Every customer gets evaluated individually in real time, which means the one-to-many model gives way to something closer to true one-to-one engagement.

The underlying logic: customer actions and signals get picked up by enterprise-side agents, processed against the customer's full profile and context, and a decision gets made — does this person need to hear from us right now, and if so, what and through which channel? When an action is taken, that interaction becomes a new signal, feeding back into the same loop.

Evergreen. Always adapting. No campaign relaunch required.

What Makes a CDP "Agentic"

CDPs have historically served three core functions: building a unified customer profile, enabling marketers to define audience segments, and pushing those segments out to execution tools like email or mobile platforms.

The limitation has always been architectural. CDPs lived outside the core data platform. They needed their own copy of the customer data, maintained their own governance layer, and required ongoing data movement to stay current.

For an agentic approach to work, that architecture breaks down. Agents need access to everything — customer history, behavioral context, business rules, predictive models, campaign performance — all in one place, without data movement introducing lag or gaps.

Databricks built CustomerLake around three requirements for what an agentic CDP needs to be:

Embedded in the lakehouse — customer data, context, and agents share the same infrastructure. No copies, no sync jobs, no reconciliation between systems.

Built around agents as the core operating model — not a conventional platform with an AI feature added. The agent is how data gets prepared, how audiences get shaped, how campaigns get planned, and how decisions get made per customer.

Capable of true one-to-one personalization at scale — not segments of thousands, but individual decisions made continuously for every customer in the system.

The Architecture

CustomerLake has two main components: Profile Agents and Campaign Agents.

Raw customer data flows in, gets processed by Profile Agents into clean unified profiles, and those profiles become the foundation for Campaign Agents to run Infinity Campaigns. A built-in Reverse ETL layer handles pushing decisions and audience data back out to the execution tools that reach customers — email platforms, ad networks, SMS, in-app messaging, and more.

Data sources include anything already sitting in the Databricks Lakehouse, plus external data from MarTech and CRM systems brought in through Lakeflow Connect. Unity Catalog handles governance across the whole stack — the same controls that apply to the rest of the data estate apply here too.

Profile Agents: Building the Customer 360

Getting to a reliable, unified customer profile is foundational to everything else. Profile Agents handle the full pipeline to get there.

Data Preparation

Lakeflow Connect brings in external data — from CRM platforms, MarTech tools, and third-party sources — alongside whatever is already in the lakehouse. Once a new dataset lands, Genie (Databricks' AI layer) takes over the preparation work.

It reads the dataset, identifies what each column actually represents — email, phone number, full name, address — and applies semantic tags accordingly. It then generates normalization rules to clean and standardize the data automatically, handling inconsistencies and filtering out invalid values without any manual mapping.

Third-party data enrichment is accessible through a Data & Identity Marketplace — providers can be connected and their data pulled in with a single click.

Identity Resolution

Matching records across different data sources — recognizing that two entries with slightly different details actually represent the same person — has always been one of the harder problems in customer data work.

CustomerLake handles this through what Databricks calls Agentic Identity Resolution, which runs across three stages:

Rules-based matching covers the straightforward cases — exact matches on unique IDs, normalized email addresses, or combinations like phone number with a fuzzy name match. The rules are readable and configurable.

LLM review handles the middle ground — cases where the rules do not reach a confident conclusion. A language model steps in to assess whether two profiles are likely the same person.

Human review is reserved for the genuinely uncertain — a queue where a person makes the final determination.

What ties this together is a feedback loop. Every decision made at the LLM and human stages gets incorporated back into the rules layer, so each run of the identity resolution process is more accurate than the last. Organizations can also bring their own ML models into the pipeline if they already have them.

When a new data source is added, Genie automatically analyzes it against existing matching rules and recommends additional rules where gaps or opportunities are identified — explaining the reasoning behind each suggestion and previewing the expected impact before anything is applied.

Gold Customer Table

The end product of Profile Agents is a Gold Customer Table — a single governed schema that every data source maps into. Where sources disagree on a field value, survivorship rules decide which one wins. The whole thing is configurable through a UI or YAML, so both technical and non-technical team members can work with it.

Campaign Agents: From Goal to Individual Decision

With a clean, unified customer profile in place, Campaign Agents take over — translating business goals into personalized, continuously running campaigns.

Building Audiences

Audience creation works through natural language. A marketer describes the audience they need, and Genie builds the segment directly against live lakehouse data. No SQL. No hand-off to a data analyst.

A marketer describes the audience they need in plain language, and Genie builds the segment directly against live lakehouse data — no SQL, no hand-off to a data analyst. Existing audiences can be refined further the same way, by simply describing the additional conditions needed. Genie converts the description into precise data filters and updates the segment instantly.

Audience insights — size trend over time, purchase category breakdown, average spend, churn risk — are surfaced automatically. Suppression rules reference live data conditions rather than point-in-time exports, so someone who converts mid-campaign is removed from eligibility immediately, not at the next scheduled refresh.

Campaign Planning

Turning an audience and a goal into a campaign starts with a brief conversation. Genie asks a focused set of questions — which channels to use, how many messages to send, when the campaign should conclude — and uses the answers to generate a structured campaign brief.

The brief covers the full picture: goals and success metrics, a sequenced messaging plan with rationale per touchpoint, timing and cadence, guardrails (frequency limits, opt-out lists, suppression of customers with open support tickets), personalization signals to draw on, and the assumptions behind the plan.

This document becomes the foundation the campaign is built from. It is fully editable before anything gets built.

Decisioning and Reasoning

Before going live, Campaign Agents can run a pre-launch simulation across a sample of real qualified profiles. The simulation shows what the agent would actually do for each person — which message they would receive, whether they would be deferred based on existing campaign load — without triggering any actual sends.

Each profile in the simulation comes with a Reasoning panel: a plain-language explanation of why that specific message was chosen, which rule it matched, and why the send timing was set the way it was. The agent also accounts for campaigns running in parallel — if a customer is already receiving heavy outreach from another active campaign, that factors into the decision before anything goes out.

This kind of per-profile transparency, available before launch rather than after a complaint, changes how marketers can review and trust the decisioning layer.

Performance and Activation

Once a campaign is live, Campaign Agents monitor it continuously — flagging performance trends and suggesting adjustments in real time. Native A/B testing makes variant comparison straightforward across the key engagement metrics.

Activation runs through Reverse ETL — bi-directional connections to the MarTech and AdTech tools already in use, covering email, SMS, in-app, and advertising platforms.

Early Customers and Partners

CustomerLake has been in private rollout with select enterprise customers ahead of the public announcement. Early customers include GM, AB InBev, HP, Circle K, Barclays, and Getnet.

The platform launches with an open partner ecosystem spanning identity, activation, measurement, and customer experience — alongside implementation partners supporting deployment at enterprise scale.

Where Things Stand

CustomerLake is currently available in Private Preview. Organizations interested in early access should reach out to their Databricks account team.

The product makes the most sense for teams whose data foundation is already on Databricks — the value comes from not having to replicate that foundation elsewhere to support marketing use cases. If the data is already there, the Customer 360, the audiences, and the campaign intelligence can be built on top of it directly, under the same governance that covers the rest of the data estate.

Sources: Introducing CustomerLake: The Agentic CDP embedded in Databricks — Databricks Blog Introducing Databricks CustomerLake — Official YouTube

Medallion Architecture Has 3 Layers. We Built 5. Here's Why — Views Layer Design on Databricks

savlahanish27 — Mon, 22 Jun 2026 07:40:25 GMT

Part 4 of my enterprise data platform series is up - this one cover why we added a fifth layer to the standard medallion architecture.

We connected BI tools to the Gold layer and immediately hit four problems Gold alone couldn't solve:

Schema breaks when we renamed a Gold column (three Tableau reports broke immediately)
Three workbooks calculating vendor aging differently, none of them agreeing
The same vendor_master join running independently across four dashboards
Row-level filtering that we didn't want duplicated in every downstream tool

All four solutions pointed to the same thing - a Views layer between Gold and consumers.

What's in the post:

Schema stability via views - one update instead of fixing every downstream query
Business logic abstraction - vendor aging buckets defined once, consumed everywhere
Row-level security with dynamic views using current_user() and Unity Catalog
Pre-joined views for heavy consumers with Databricks Dashboard query cache
View naming convention, Git-based version control for view definitions
What I'd do differently - designing Views before Gold, not after

Full post on Medium: https://medium.com/@savlahanish/medallion-architecture-has-3-layers-we-built-5-heres-why-41408c71c6b7

Part 5 is where it gets messier - Tableau and Databricks Dashboards behaving differently against the same views, a decimal precision issue that cost two hours, and what happens when interactive queries and Tableau batch refreshes hit the same SQL Warehouse at 9am.

Happy to answer questions on any of the decisions - particularly around the row-level security pattern or the naming convention.

Metric Views with Power BI and Tabular Editor (Part 3 of 3)

KrisJohannesen — Mon, 22 Jun 2026 07:37:32 GMT

This is part 3 of 3 in a series where I take you through working with Metric Views.

Part 1: Introduction to Metric Views
Part 2: Metric Views and the Databricks platform (AI/BI Dashboards, Genie, etc.)
Part 3: Metric Views with Power BI and Tabular Editor

Not all semantic models are created equally

While a semantic model is a general concept that can be applied across a lot of different systems, I think it is no coincidence that Microsoft uses this exact term in Power BI to describe their models that are underpinning their reporting capabilities. As a result of this, people might associate the Semantic Model directly with Power BI for this reason, but as described in Part 1, we consider Metric Views to be a Semantic Model of its own as well.

In order to understand a bit about how and why things work the way they do, I think it might be a good idea to highlight some core concepts of Semantic Models, and how they work a little different between Databricks and Power BI.

Different models, same purpose

In my mind, a Semantic Model in Power BI are synonymous with a star-schema model. This is mainly due to the fact that the Power BI engine is designed around the star-schema, which means that it evaluates queries faster when the model is designed like this. When loaded into the semantic model, a star schema preserves each of the tables on their own.

In contrast to the star schema, a Databricks Metric View resembles more of a One-Big-Table style of semantic model. So even though we have facts and dimensions stored in separate tables, the Metric View itself is defined as a single view. This does introduce some issues, such as what to do in the case of a multi-fact model, or how to solve different granularities such as time. That being said, I do expect the Metric View to evolve to possibly handle some of these cases in the future.

For more details on the different types of Semantic Models, and their pros/cons I recommend this article by Kurt Buhler, written for Tabular Editor. This is centered around Power BI, but the general concepts are applicable across different BI tools.

Data model types, examples, and tips for Power BI

Loading Metric Views in Power BI - Natively

Databricks Metric Views provide a unified semantic layer that can be queried via SQL. We have dimensions, measures and even some nice metadata descriptions, type hints and similar resources defined. Therefore, you might have the same idea as me.

We should be able to use the Metric View as the source for a Power BI report directly

With my complete naive and optimistic approach, I went ahead hoping that at least some of the model could be loaded and used as it was defined in Databricks. Well, unfortunately that was not at all the case. In fact, if you try to setup a Power BI model querying a Metric View using the native Databricks connector, pointing directly as your Metric View, you will not even be able to get anywhere. You will instead be met with the below error message.

In short, the above error tells you that in order to query a Metric View in SQL, you need to use the MEASURE() reference in each of the defined measures. Honestly, this makes sense, since it is the same approach you need when querying the Metric View inside of Databricks' own platform.

Here, I have added a short video shared by Databricks, where Simon Whiteley demonstrates exactly how a SQL Query of Metric Views actually works in practice.

Alternative approaches and workarounds

Alright, so since there is no native support, and it unfortunately does not exist on any roadmaps yet, I was wondering if we can do something else to actually leverage the definitions we have already created. The obvious solution is of course to lean into the Databricks native AI/BI dashboards as described in Part 2, but I also know that some companies are so heavily invested in Power BI that this would not be a viable approach, at least not in the short-term.

So I set out to test some workarounds. I am not aiming for this to be the perfect solution, so to be honest, I have fairly low expectations of anything actually working dynamically in terms of measure evaluation. I know that this is honestly not very useful when comparing to a DAX measure that is automatically evaluated at the granularity reflected, but let's see if we can work some magic.

Load to Power BI using SQL Query

While this might not be the most elegant of solutions, and you do end up losing quite a lot of the underlying logic of the Metric View itself, you do have the possibility of loading the Metric View using a SQL Query. This can be done in one of two ways, both of which are not that elegant:

Create a View on top of your Metric View (I know - this sounds dumb). In the view, the measures of the Metric View need to be referenced as MEASURE(measure 1) as 'measure 1' while the dimensions can be referenced directly by name.

Create a SQL query against Databricks directly. When setting up your Databricks connection, use a SQL query and write up the same definition as you would for the above view. Same concept, but you do not get an additional object inside of Databricks - instead you get a SQL query that lives inside your M partition.

Both of these approaches allows you to load the columns correctly to Power BI, however you are essentially left with a regular view, with no annotations, metadata or measure definitions.

None of these approaches actually achieve any benefit in having the Metric View. They would both work just as well by just using a regular View from the beginning - or even better - by loading the fact and dimensions from tables/views separately and defining your model inside of Tabular Editor (or Power BI Desktop). With the correct metadata in Unity Catalog, you would even get the Relationships defined automatically. Check out this article (and the rest of the series), written by 💀 Johnny Winter 💀 for some great tips and tricks.

Tabular Editors Semantic Bridge:

In November 2025, Greg Baldini from Tabular Editor went on the Explicit Measures Podcast to discuss their new MVP feature called the Semantic Bridge that they are working on alongside Advancing Analytics. The Semantic Bridge introduction and discussions starts at around minute 22:00 - but honestly the whole thing is really interesting.

This feature is meant to work across multiple semantic layers, and might be the missing link that can actually enable us to work with Metric Views in Power BI. While this is still in its early stages, I found the discussions around how to work across different formats and syntax really interesting and I can't wait to see what we can do with this feature in the future.

There will be a Bonus Episode to this series with more details on the Semantic Bridge specifically!

Conclusions

No Native Integration (Yet): Power BI does not natively support Databricks Metric Views. While the Databricks connector allows listing them from Unity Catalog, it cannot query them directly due to the required use of MEASURE() and GROUP BY logic that Power BI does not generate automatically. At this stage, Metric Views are most useful within the Databricks environment itself. Native AI/BI dashboards, and tools like Genie natively understand and correctly interpret Metric Views with no workarounds

SQL Workarounds Require Redefinition: The only functional workaround today is to write custom SQL queries in Databricks that explicitly call MEASURE() and expose the result as a view or table. These can then be imported into Power BI. However, this approach redefines the logic of the Metric View, weakening the promise of a single source of truth and also requires you to re-define your measures as DAX on the back of the integration.

Tabular Editor & Semantic Bridge: This might be the short term solution to cross-integration between the two platforms that allows us to translate one to the other and vice-versa, however it is still not a direct connection.

The Community is asking for more: There is growing demand for a native Power BI integration. Want to support it yourself, then check out this Microsoft Fabric community idea.

If you are really deep into Databricks already, it might be worth considering if the time has come to move your Analytics and Dashboards from Power BI into Databricks. But that's a topic for another day!

Getting Certified as a Databricks Generative AI Engineer Associate: Key Takeaways and Insights

AngelShrestha — Mon, 22 Jun 2026 04:55:05 GMT

I just earned my Databricks Certified Generative AI Engineer Associate Certification, and in this post, I’m sharing the key tips, resources, and including what confused me, what actually worked, and the traps I nearly fell into.

Why I Took This Exam

I work across building scalable ML and Gen AI solutions and architecture, which means staying current on the GenAI stack is a practical requirement, not just a resume item. While working on a recent project, I started exploring Databricks more deeply, and I found a platform that have evolved from data engineering into a serious end-to-end system for building production AI applications, from data ingestion all the way to agents, monitoring, and governance.

I'm sharing this not as a polished success story, but as an honest account of the preparation process; including the topics that genuinely confused me, and what actually helped. I hope it's useful whether you're just starting to explore the platform or actively preparing for the exam.

About the Exam

The Databricks Certified Generative AI Engineer Associate tests the full lifecycle of building GenAI applications on Databricks; from design and data preparation through to deployment and monitoring. Approximately 56 multiple-choice questions in 90 minutes, including some unscored questions.

Domain Breakdown

Domain	Weight	Focus Area
Design Applications	14%	Prompt design, model selection
Data Preparation	14%	Chunking, embeddings, vector search
Application Development	30% (heaviest)	Agent tools, frameworks, deployment patterns
Assembling & Deploying Apps	22%	MLflow, Model Serving, CI/CD, Apps
Governance	8%	Unity Catalog, access control, lineage
Evaluation & Monitoring	12%	MLflow judges, monitoring pipelines

For a detailed overview, access the complete exam guide.

What Actually Helped Me Prepare

1. The Four Official ILT Courses

I completed the instructor-led track end-to-end. All four. In order. These are well-structured and having a live instructor to ask questions made a real difference when concepts felt confusing.

Building Retrieval Agents on Databricks — RAG pipelines, embeddings, Vector Search, chunking strategies, MLflow tracing for agents
Building Single-Agent Applications — UC function tools, LangChain integration, ResponsesAgent, MLflow logging and reproducibility, Agent Bricks
Generative AI Application Evaluation and Governance — MLflow judges (built-in, guideline, custom), offline vs online evaluation, the Review App, human feedback loops
Generative AI Deployment and Monitoring — Batch vs real-time deployment, Lakehouse Monitoring, LLMOps vs MLOps, Databricks Asset Bundles

The courses provide a strong mental model for building and operating GenAI applications, and the hands-on labs reinforce the concepts as you learn them.
You can explore and register for these courses through the Databricks Training Catalog: Databricks Training Catalog. Some courses are free, while others are paid.

2. Demo notebooks and Labs

I also went through the hands-on demos and labs for each module. This will help you gain practical knowledge of concepts on Databricks .
Note: The self-paced courses are free to access, but the demo/lab notebooks require an annual subscription.

3. Going Deep on the Official Documentation

After completing the courses, I spent time going through the documentation for each topic they covered. The docs are the most reliable source for exam-specific details and help fill in many of the gaps that the courses only touch on at a high level.

I highly recommend reading everything in the Databricks Agents documentation: Databricks Agents Documentation. It covers a large portion of the theoretical knowledge that is in depth for the concepts in the training courses.

4. A Decision-Table Revision System (with AI)

This was one of the most effective things I did. I used AI, specifically Claude as a study partner, not to get answers handed to me, but to work through concepts conversationally, then consolidate everything into a structured revision document focused on the comparison layer.
The exam doesn't reward definitions. It rewards scenario reading, understanding which option is correct given specific constraints buried in a paragraph. Many questions include subtle details that change the correct answer.
Instead of creating notes like "Vector Search exists," I focused on comparison-based revision tables such as:

Structure-aware vs semantic vs fixed-size chunking: when each is correct and why
Standard vs Storage-Optimized Vector Search endpoints : the multi-constraint decision
Continuous vs triggered sync: matched to data update cadence
Delta Sync vs Direct CRUD: when lineage matters vs when it doesn't
Pay-per-token vs Provisioned throughput - what you use according to your consumption to lower cost.
Batch (ai_query) vs real-time Model Serving: based on latency and use case
Reference-free vs reference-based MLflow judges: know which requires ground truth

By organizing concepts as decisions rather than definitions, I found it much easier to recognize the correct answer when presented with real-world scenarios on the exam.

Exam Day Tips

You have enough time. 56 questions in 90 minutes. I finished in 77 minutes with time to review. Don't rush. Use the mark-for-review feature and do a second pass on anything uncertain.
Read the full scenario before the options. The constraints buried in the middle of the paragraph often determine the correct answer. Options A and B may look equally plausible until you notice a latency or cost constraint you initially skipped.
Diagnose before you answer. For questions describing a problem , wrong tool call order, slow latency, poor retrieval; train yourself to identify which component in the pipeline is actually failing before reading the options.
Code questions are read, not write. You might never be asked to write code from scratch. You will be asked to read a snippet and identify what is wrong, what it does, or why it behaves unexpectedly. The key skill is recognising common anti-patterns.
The exam is more conceptual than Databricks-syntax-heavy. General GenAI knowledge matters: hallucination types, RLHF mechanics, RAG vs fine-tuning tradeoffs. The courses assume this background. Address that gap directly if you're light on it.

Topics That Required Extra Attention: Personal View

Topic 1: Chunking Strategy Selection

All chunking strategies sound similar until you need to choose between them under exam pressure. The clearest framing I found:

Scenario Signal	Use This
Consistent headings or sections in the document	Structure-aware: boundaries already exist, use them
No explicit structure, prose flows naturally	Embedding-based semantic: detects topic shifts via similarity
Context getting cut off at chunk boundaries	Add 10–20% overlap: prevents split-concept retrieval failure
Both specific and broad user questions expected	Parent Document Retrieval: small chunks for precision, parent for context
Approaching the embedding model's token limit	Sub-chunk: databricks-gte-large-en silently truncates at 1024 tokens, no error

The Silent Truncation Trap

Embedding models don't error on oversized input, they silently truncate. Content beyond the token limit is simply never represented in the embedding vector. This is one of the most commonly missed details in exam questions. There's no warning, no exception, no indication anything went wrong.

Topic 2: Vector Search Configuration

The Standard vs Storage-Optimized decision depends on the combination of constraints given in a scenario. Checking only one factor leads to the wrong answer.

Choose This	When
Standard endpoint	Strict latency (<200ms), high QPS (100+), smaller index (<2M vectors)
Storage-Optimized	Large index (10M+ vectors), cost is priority, 500ms+ latency acceptable
Continuous sync	Data changes in real-time or near-real-time (minutes)
Triggered sync	Scheduled updates: match frequency to actual cadence
Direct CRUD API	Real-time vector insertion with no Delta table backing it

Topic 3: Deployment Patterns and Code Anti-Patterns

Specific things kept appearing in practice scenarios:
Delta Sync vs Direct CRUD: Delta Sync is right when your source data lives in Delta and you want full lineage, governance, and rebuild capability. Direct CRUD is right when you need real-time vector insertions without a Delta backing table.
Incremental updates: Only processing changed documents requires enabling delta.enableChangeDataFeed on your Delta table and using MERGE INTO rather than truncate-and-reload. Without this, a nightly pipeline re-processes 100,000 unchanged documents when only 200 actually changed.
Critical anti-pattern: Never put expensive initializations (database clients, model connections) inside predict() in a PyFunc model. That runs on every request. They belong in load_context(), which runs once at model load. The symptom: every request is slow, not just the first.

Topic 4: Model Selection Without Hands-On Experience

If you haven't worked across different model families, the exam tests tradeoffs you may never have consciously thought about. The ones that came up:

Latency vs quality: A 7B model at 150ms may be the only viable choice over a higher-accuracy 34B model at 1,800ms when the SLA is 200ms. Better benchmark score is irrelevant if the model can't meet the constraint.
Multilingual requirements: English-only embedding models (databricks-gte-large-en, bge-large-en) produce poor embeddings for non-English content regardless of quality. Multilingual scenario = multilingual model.
Tool-calling capability: Not all LLMs support function/tool calling. If a model never calls tools during testing, this is the most likely explanation.
Task-specific fit: A narrow fixed-category classification task at high volume (40,000 daily requests) is better served by a small fine-tuned classifier than a large general-purpose LLM; on both latency and cost per inference.
Evaluation metrics by task: HumanEval for code generation, BLEU/ROUGE for translation, domain-specific benchmarks for everything else. Highest overall score ≠ best fit for your task.

Topic 5: AI Gateway : Three Features, Three Jobs

Likely to appear on the exam, and the three features are easy to conflate. Know exactly which one solves which problem:

Feature	Solves
Inference Tables	Full audit trail: complete request/response payload per interaction, queryable by timestamp
Usage Tables	Cost attribution: aggregated token consumption by team/endpoint for chargeback
Rate Limiting	Enforcement: cap requests per user or service principal regardless of which app is calling

Topic 6: Evaluation Judges: Ground Truth Requirements

This distinction comes up directly in exam questions. Know it cold before exam day:

Judge	Needs Ground Truth?	Notes
Correctness	✓ YES	needs expectations field
RetrievalSufficiency	✓ YES	needs expectations field
RelevanceToQuery	✗ NO	reference-free
RetrievalGroundedness	✗ NO	reference-free
RetrievalRelevance	✗ NO	reference-free
Safety	✗ NO	reference-free

Topic 7: The Monitoring Pipeline: Understand Why, Not Just What

The sequence is:

Inference Table → Structured Streaming (unpack raw JSON) → processed Delta table (CDF enabled) → Lakehouse Monitor (Time Series profile) → profile and drift metrics tables

Understanding why each step exists matters more than memorising the sequence. You can't run meaningful monitoring directly on the raw inference table because request/response payloads are stored as opaque JSON strings; monitoring them computes statistics on string length, not on actual semantic content. Unpacking first gives you toxicity scores, response length distributions, and anything semantically meaningful.

Topic 8: Agent Bricks: Knowing When NOT to Use Them

Agent Brick Type	Right Scenario
Knowledge Assistant	RAG over documents with citations. No ML expertise needed. Fast time-to-production.
Information Extraction	High-volume unstructured to structured field extraction to a Delta table.
Multi-Agent Supervisor	Routing between structured (Genie/SQL) and unstructured (RAG) sources. Can also run as single agent with just a toolkit.
Custom LLM	Strict tone, format, or compliance requirements baked into the model; not just a system prompt.

Exam trap: If an agent already exists and is working, don't rebuild it with Agent Bricks. Extend it. Agent Bricks is for starting from scratch when the use case fits a known pattern.

Final Thoughts

This certification covers material that maps directly to real production work. The preparation process pushed me to understand not just what each Databricks tool does, but when to choose it over the alternatives; which is the thinking that actually matters when designing real systems.

Go beyond the courses. Build your own comparison-layer reference. Pay close attention to the 'when to use what' questions. That's where this exam lives.
I'm confident I'll be applying these skills right away in the solutions and architectures I design. It's one of those certifications where the knowledge gained has immediate practical value and translates directly into real-world impact.

200,000 strong and just getting started. My Data and AI Summit 2026

Brahmareddy — Fri, 19 Jun 2026 19:10:36 GMT

Just back from Data and AI Summit 2026 and I am still buzzing.

This was my best summit yet. Moscone was packed. More than 30,000 of us in one place, from over 150 countries, all there for the same reason. To build better with data and AI.

The keynotes set the tone. Ali, Matei, Arsalan and Reynold on the main stage. Satya Nadella and Greg Brockman joining in. The message was clear. The era of agents is here, and it runs on three things. Context, control and choice.

That hit home for me. I have been saying for a while that agents are only as good as the data substrate underneath them. Good data engineering is not a nice to have anymore. It is the precondition for autonomy. This summit made that real.

The product news backed it up. Lakebase is now doing 12 million database launches a day. Agent Bricks crossed 100,000 agents built and is processing more than a quadrillion tokens a year. Genie is moving from a chat box to a real coworker. And Free Edition, which I use every day, now ships Genie Code, serverless GPUs, Lakebase, Agent Bricks and Lakeflow Designer. The full toolkit, no cost. That last one matters a lot to me. Most of my POCs run on Free Edition.

The whole week felt like the platform moving in the same direction I have been writing about. Context first. Apps on top. Governance around it.

What I am most proud of though is the community.

We crossed 200,000 members. Let that sink in. 200,000 practitioners helping each other ship real work. None of this happens by accident. The community team carries it. @MandyR, @Advika, and @Sujitha Thank you. You lead with care and you make this place feel like home for every new person who walks in.

I also met many Databricks leaders this week. The hallway conversations were as valuable as the sessions. Their inputs gave me a lot to think about and a lot to build.

Being a Community Champion is a privilege I do not take lightly. My goal is simple. Help other members do their work better and grow their careers. through my community posts, through POCs, through Databricks developer community. We grow when we lift each other.

Databricks is not just keeping pace in data and AI. It is setting the direction. And we get to build on top of it together.

Already counting down to the next one. Let us make this community even more impactful.

Who else was there? Tell me your top moment.

How a Partitioning Mistake Turned a 12-Minute Databricks Job Into a 2-Hour Nightmare

Avinash_Narala — Fri, 19 Jun 2026 16:13:36 GMT

Hello Databricks Community!

I recently published a detailed breakdown on Medium about a real-world optimization nightmare we faced, and I wanted to share the core lessons learned with this group.

We had a highly efficient Delta table pipeline handling 1.2 billion records that completed its hourly incremental updates in just 12 minutes. In a bid to speed up specific queries, we made a seemingly logical choice: partitioning the table by a high-cardinality column (TransactionID).

Instead of speeding things up, this single layout choice turned our 12-minute job into a 2-hour nightmare.

The root cause? A catastrophic small file explosion (creating 2.7 million partitions and 3.2 million tiny files) that completely drowned Spark in metadata overhead. Upgrading cluster sizes, running standard OPTIMIZE, and trying ZORDER barely made a dent because Spark was spending all its time just navigating physical directories.

We ultimately solved this by migrating completely to Delta Lake's Liquid Clustering, which slashed our file count down to 18,000, removed directory overhead entirely, and dropped our total pipeline runtime down to just 8 minutes.

I've shared the full, step-by-step optimization journey, including our exact benchmarking numbers for each failed attempt, over on Medium.

👉 Read the Full Story on Medium

The Results

Metric	Traditional Partitioning	Liquid Clustering
Total Files	3.2 Million	18,000
Partition Directories	2.7 Million	0
Pipeline Runtime	~120 minutes	8 minutes

Key Takeaway

The old rule of "partition by the column you filter on" fails spectacularly on high-cardinality keys like IDs. If you are facing massive metadata overhead or slow merges, skip the cluster upgrades and switch to Liquid Clustering.

Have you run into similar small-file bottlenecks in your production environment? Let's discuss below!

Foresight — The Third Temporal Dimension of Delta Lake

harsh0610 — Fri, 19 Jun 2026 15:52:49 GMT

Delta Lake gives you time travel backward:

SELECT * FROM sales TIMESTAMP AS OF '2026-01-01'

But what about forward? What will your Delta table look like next month?

Nobody has built probabilistic future queries as a first-class Delta concept — until now.

Introducing Delta Foresight

GitHub: https://github.com/HarshalSant/delta-foresight Install: pip install delta-foresight

from delta_foresight import DeltaForesight

df = DeltaForesight(table="catalog.schema.daily_sales", time_column="sale_date", spark=spark) df.fit() forecast = df.predict(as_of="2026-09-01", confidence=0.90) df.materialize("catalog.delta_foresight.daily_sales_forecast")

Then query it with SQL: SELECT ds, revenue_forecast, revenue_lower_90, revenue_upper_90 FROM catalog.delta_foresight.daily_sales_forecast

What Makes It Different

The forecast IS a Delta table — not a report, not an export. Governed by Unity Catalog, queryable with SQL, shareable via Delta Sharing.
Mathematically valid prediction intervals — uses conformal prediction with proven coverage guarantees. Ask for 90%, get 90%.
Learns your table's temporal DNA — auto-detects frequency, trend, and seasonality from your own Delta history. No manual setup.
MLflow tracking built in — every forecast run logged automatically.
Works inside and outside Databricks — PySpark on Databricks, delta-rs locally, Parquet fallback for dev.

Use Cases

Revenue planning: predict month-end close
ML model health: forecast when AUC will breach threshold
Inventory: predict stock depletion date
Cost management: forecast DBU burn rate
Data quality: forecast null rate trajectory

CLI

foresight predict --table catalog.schema.daily_sales --as-of 2026-09-01 foresight fingerprint --table catalog.schema.daily_sales foresight serve # REST API at localhost:8080/docs

Feedback welcome — which use case matters most to your team? GitHub Issues: https://github.com/HarshalSant/delta-foresight/issues

Also author of vigil-ml:

https://www.linkedin.com/in/harshalsant0

https://github.com/HarshalSant/

DAIS 2026: The Databricks Announcements I Think Clients Should Pay Attention To

mou — Fri, 19 Jun 2026 15:49:14 GMT

The most important thing I took away from Data + AI Summit 2026 was not one product announcement.

It was the direction.

Databricks is building around a very real enterprise problem: companies want AI to help with decisions, operations, customer engagement, software development, security, and analytics, but the AI has to work inside the reality of the business.

That reality includes messy data, strict permissions, different definitions of the same metric, pipelines that break, models that drift, sensitive customer data, many clouds, many tools, and teams that already have enough platforms to manage.

This is why I found this year’s announcements interesting. They were not only about adding more capability. They were about reducing the distance between data, context, AI, governance, and action.

Context is becoming the real AI foundation

The announcement around Genie One, Genie Agents, and Genie Ontology was one of the strongest signals from the summit.

The reason is simple. A business user does not need another generic chatbot. They need an AI experience that understands how their company works.

In most organizations, the business meaning of data is spread across dashboards, SQL queries, notebooks, pipelines, documents, wikis, tickets, and team knowledge. A table may be accurate, but the real definition of the metric may live somewhere else. A dashboard may be popular, but not always certified. A calculation may be used in production, but not documented clearly.

This is the gap Genie Ontology is trying to close.

The interesting part is not only that Genie can answer questions. The interesting part is that Genie can use business context, source authority, freshness, usage, relationships, permissions, and trusted definitions to decide how to answer. That is the difference between an AI answer that sounds right and an AI answer that the business can trust.

Genie One then puts that experience where people work: data, apps, Slack, Teams, mobile, MCP-based experiences, and agent workflows. Genie Agents extend it further by letting teams create domain-specific agents grounded in the same trusted context.

For clients, this is a major point. AI accuracy will not come only from better models. It will come from giving the model the right business context, close to the governed data.

Agent engineering is becoming a platform problem

Agent Bricks and Omnigent were also important to me because they address what many teams are starting to learn.

Building an agent demo is easy. Running agents safely at enterprise scale is not.

Databricks made a very useful point in the Agent Bricks announcement: the core agent loop is only a small part of the work. The hard parts are token capacity, deployment, security, evaluation, monitoring, context, sharing, memory, cost control, and safe execution.

That matches what I see with clients. The excitement around agents is real, but the operating model is still immature. Teams are using different coding agents, different models, different harnesses, different prompts, and different security patterns. That works for experimentation. It does not scale cleanly.

This is where Agent Bricks becomes relevant. It is moving from agent building into a broader agent platform, with model choice, secure sandboxes, memory, skills, MCP support, evaluation, governance, and token controls.

Omnigent is also a smart move. Enterprises are not going to use only one coding assistant or one framework. They will use Claude Code, Codex, custom agents, internal tools, and new tools that are not even popular yet. A meta-harness gives teams a way to compose, control, and share agent workflows without locking everything to one tool.

The managed Omnigent direction on Databricks is especially practical: shared history, remote access, collaboration, isolated execution, and governance through Unity AI Gateway.

My view is that agent development is about to look more like software engineering and platform engineering. The teams that treat agents only as prompts will struggle. The teams that treat agents as governed systems will move faster and with less risk.

ZeroOps is one of the most practical announcements

I liked Genie ZeroOps because it is close to the daily pain of data and ML teams.

Anyone who has worked on a production data platform knows this pattern. A pipeline fails. A schema changes. A table looks fine but the data quality has silently changed. A dashboard number moves and nobody immediately knows whether it is a real business change or a data issue. A model starts producing weaker predictions without throwing an error.

A general coding agent can help write code, but data and AI operations need more than code. They need lineage, logs, telemetry, platform events, data quality signals, job history, permissions, and safe validation against real data.

That is why the ZeroOps flow is useful: detect, assess, remediate, and verify.

The verify step is the part I care about most. Proposed fixes can be tested in a secure sandbox using zero-copy clones, scoped permissions, and isolation before anything touches production. That is a practical enterprise pattern. It keeps people in control while cutting down the time spent on investigation and root-cause analysis.

For ML, this becomes even more important. A model can be technically “up” and still be wrong. Genie ZeroOps for ML can help investigate drift, serving errors, pipeline problems, and production performance issues. As more teams use AI to build more models and pipelines, this operational layer becomes necessary.

Real-time is moving closer to the lakehouse

Lakehouse//RT, Lakebase, Lakeflow, and LTAP all connect to a long-running architecture issue.

Many companies still use separate systems for transactions, analytics, streaming, serving, applications, and AI. This creates copies of data, sync jobs, governance gaps, and additional places where things can fail.

Lakehouse//RT is Databricks’ answer for real-time operational analytics, BI, app serving, and observability workloads directly on the lakehouse. The message I liked from the Lakehouse//RT announcement is that separate serving layers have a real cost: duplication, governance drift, and engineering overhead.

Lakehouse//RT, powered by Reyden, is aimed at millisecond performance without moving data away from the lakehouse. The benchmark numbers are impressive, but the architecture point is more important to me. If teams can serve real-time apps, dashboards, and agent workflows from the same governed data foundation, they reduce a lot of unnecessary complexity.

LTAP goes in the same direction. Lakebase supports transactional workloads. Lakeflow supports ingestion, transformation, orchestration, and pipeline development. Together, they bring transactional and analytical processing closer to the governed lakehouse.

This is very relevant for AI. Agents need current data. Customer experiences need current data. Fraud, supply chain, finance, security, and operations use cases need current data. If data is delayed or copied across too many systems, AI becomes less useful and harder to trust.

Governance has moved into the AI runtime

The Unity Catalog and Unity AI Gateway announcements may be less flashy than agents, but they are extremely important.

Governance is changing. It is no longer only about who can query a table or access a dashboard. Agents can call tools, invoke MCP servers, write code, generate artifacts, trigger workflows, and act across systems. That means governance has to follow the AI interaction itself.

Unity AI Gateway is important because it extends governance into models, agents, MCP services, skills, tools, cost controls, routing, monitoring, and runtime policy enforcement.

The partner ecosystem around Unity AI Gateway also matters. Databricks is integrating with AI security, identity, observability, DLP, runtime guardrail, and agent governance providers. That is important because large companies already have security and identity tools. AI governance cannot live in a separate island.

I also paid attention to the security and compliance announcements: Automatic Identity Management for Entra ID, Okta support in preview, Context-Based Ingress, Private Network Gateway, Lakebase private connectivity, HITRUST across clouds, expanded GovCloud support, and FedRAMP High support coming on Azure Commercial.

This is the work that makes AI usable in regulated environments. It may not get the loudest applause, but clients will care about it when they move from pilots to production.

Apps, Marketplace, and OpenSharing show a broader ecosystem play

The Apps, Marketplace, and OpenSharing announcements were also meaningful.

Databricks Apps is becoming more important because many useful enterprise solutions are small and very specific: an operations portal, a workflow manager, a data quality review app, an internal AI assistant, a model interface, or a business process app. These apps often get delayed because of infrastructure, cost, security review, or unclear ownership.

App Spaces gives admins a way to define access, resources, API scopes, and security policies for groups of apps. Genie App Builder helps teams build apps with awareness of Databricks data, Unity Catalog semantics, and workspace context. Serverless Micro Apps make the economics better for apps that are useful but not always running.

This is a good pattern: let the people closest to the business problem build, but do it inside a governed boundary.

Marketplace and OpenSharing extend this to partners and data providers.

The Marketplace commit drawdown and upcoming transactability are important for commercial adoption. Partners can reach Databricks customers more directly and shorten sales cycles by using pre-committed spend. Apps and Genie Agents can also be distributed through Marketplace, which opens new packaging models.

OpenSharing is the larger architecture move. Delta Sharing was about open, zero-copy data sharing. OpenSharing extends that idea to the agentic era: structured data, unstructured data, models, skills, semantics, and Genie Agents across clouds, platforms, and organizations.

The ability to share a Genie Agent is very interesting. A provider can share an AI experience over their data without forcing the customer to learn the schema, build a UI, or access every underlying table. That can change how proprietary data providers package their value.

This is where “pay per question” becomes more than a marketing idea. A data provider could let customers ask governed natural-language questions against proprietary data, with limits on prompts, rows, and access. That is a very different commercial model from traditional data licensing.

CustomerLake is a good example of AI moving into business workflows

CustomerLake caught my attention because it shows how Databricks is moving closer to business functions, not only technical teams.

Customer data is one of the hardest areas in any company. It is sensitive, duplicated, fragmented, and constantly changing. Traditional CDPs helped marketers activate customer data, but they often created another platform outside the governed data foundation.

CustomerLake takes a different approach by embedding the CDP into Databricks.

The idea of Golden Context is important. A customer profile is useful, but it is not enough. The AI also needs business goals, live signals, channel context, past decisions, and what has already been tried with that customer.

The idea of Infinity Campaigns is also interesting. Instead of static campaigns and large segments, the direction is always-on, real-time, 1:1 engagement where agents help adapt timing, message, and channel based on current context.

This will be a major discussion for marketing, customer experience, and data teams. The CDP conversation is moving closer to the data foundation, governance model, and AI architecture.

ML is becoming more native to the platform

The AI Platform announcements also had a strong practical angle.

Genie Code for ML is useful because ML work is not only writing Python. It includes feature engineering, experiment tracking, evaluation, model registration, deployment, serving, monitoring, drift analysis, and retraining. A generic coding agent will not understand the full ML lifecycle unless it is connected to the platform context.

Genie Code integrates with Unity Catalog, Feature Store, MLflow, AI Runtime, Model Serving, Inference Tables, and production observability. That context is what makes it more useful for ML teams.

AI Runtime is another important step. Serverless A10 and H100 GPUs, multinode training, Lakeflow Jobs support, MLflow observability, and Unity Catalog governance help teams train and fine-tune models without spending so much time on GPU infrastructure.

The real-time ML announcements also matter: streaming features, declarative feature engineering, online feature serving on Lakebase, and high-QPS Model Serving. These are the capabilities that support fraud detection, recommendations, personalization, search, and other low-latency production use cases.

The platform direction is clear: ML should not feel like a separate stack attached to the lakehouse. It should be part of the same governed system.

My takeaway

DAIS 2026 showed Databricks moving toward a more complete operating foundation for enterprise AI.

The common thread I saw across the announcements was this:

Bring context closer to data.
Bring governance closer to AI behavior.
Bring real-time and transactional workloads closer to the lakehouse.
Bring agents closer to safe engineering practices.
Bring apps and business workflows closer to the governed data foundation.
Bring operations closer to automation, but keep humans in control where it matters.

That is a very practical direction.

For clients, the next phase of AI will not be won by creating more disconnected AI agents. It will be won by building a foundation where AI can understand the business, access trusted data, follow governance, work with current context, and take action safely.

That is why I think the DAIS 2026 announcements are worth paying attention to.

They show Databricks moving closer to how real companies actually need AI to work.

Evaluating GenAI Applications the Right Way in Databricks Ecosystem

vinaychavan — Tue, 09 Jun 2026 05:12:05 GMT

Hello Everyone!

I've been spending a lot of time lately thinking about something that keeps coming up in almost every GenAI project I touch — how do you actually know if your model is working well? Not just in demos, but in production, day after day.

So I sat down and jotted down some of my learnings around effective model evaluation techniques for GenAI applications using the Databricks ecosystem. What does good evaluation actually look like? Why do your old ML metrics (Precision, Recall, MAE, MAPE) still matter more than you think? And how do you build a continuous eval loop that catches problems before your users do?

https://medium.com/@vinu2433/evaluating-genai-applications-the-right-way-4def3276018e?postPublishedType=initial

This blog walks through the full evaluation stack — from classification and regression metrics on your retrieval and extraction layers, all the way to LLM-as-a-judge and RAG-specific metrics like faithfulness and context recall — with real Databricks code and MLflow integration throughout.

In upcoming posts, we'll go deeper into prompt engineering strategies, production monitoring patterns, and building eval pipelines at scale on Databricks. Stay tuned, and I'd love to hear how your teams are approaching evals in the comments below!

What Happens When a Data Platform Starts Making Its Own Decisions?

balajiselvarasu — Tue, 09 Jun 2026 08:25:42 GMT

Data platforms are getting smarter — but are we asking the right questions about what that means for data engineering?

I wrote about how Databricks Predictive Optimization is shifting the role of data engineers from reactive maintenance to autonomous operations. The article covers:

🔹 Why optimization becomes a visibility problem at enterprise scale 🔹 How Predictive Optimization actually works under the hood 🔹 Z-Ordering vs Liquid Clustering — and how Predictive Optimization handles both differently 🔹 Where automation wins, and where engineering judgment still matters

The question I keep coming back to: how much of data engineering optimization will remain manual five years from now?

Would love to hear how others in the community are approaching this — especially those who have already enabled Predictive Optimization at scale.

👉 [https://medium.com/p/37a11392d2af?postPublishedType=initial]

Z-Ordering VS Liquid Clustering

ManojSampath — Tue, 09 Jun 2026 07:48:40 GMT

Hi everyone,

I recently published a technical blog on Z-Ordering vs Liquid Clustering in Delta Lake, covering the internals of both techniques in detail.

Rather than focusing only on the syntax, the blog goes deeper into:

The origin of Z-Ordering — tracing back to Morton's space-filling curve from 1966
How bit interleaving works and how Z-values are computed
Why Z-Ordering degrades over time on live, continuously loaded tables
How Liquid Clustering addresses this using the Hilbert Curve
What CLUSTER BY AUTO does internally — query pattern tracking and frequency scoring
Why partitioning and Liquid Clustering are incompatible per the official documentation
Practical recommendations from real Supply Chain ATP implementations

The blog is built around a Supply Chain Available to Promise scenario to keep the concepts grounded in a real-world context.

Would appreciate any feedback or thoughts from the community.

https://medium.com/@pmanoj0104/z-ordering-vs-liquid-clustering-a79a12ad0038

Lakeflow Connect: Managed Ingestion Without the Pipeline Tax

SHIVAMPORWAL — Tue, 09 Jun 2026 11:05:43 GMT

I recently published a piece on Lakeflow Connect and wanted to share it here since this community is where the conversation actually happens.

The post covers something most of us have lived through, the hidden cost of maintaining ingestion pipelines. The Fivetran subscription, the S3 landing zone, the Airflow DAG, the custom CDC merge logic, the monitoring stack, five vendors and 1,200 lines of code just to get data from Salesforce into Delta.

Lakeflow Connect collapses that into one declarative resource inside Databricks. I broke down:

What changes architecturally when you migrate, including the before/after diff
How log-based CDC and schema evolution are handled natively
Where Lakeflow Connect fits, and where it doesn’t, since streaming with sub-second latency still belongs in Structured Streaming
What this means for data teams thinking about headcount and tool consolidation

Full post on Medium:
https://medium.com/@sporwal8989/lakeflow-connect-managed-ingestion-without-the-pipeline-tax-1d5fd74d516f

A few things I’d love to discuss with this community:

For teams that have already migrated, what was the most painful part of the cutover?
The connector catalog is growing fast but isn’t universal yet. What sources do you wish were supported that aren’t?
How are you handling the gap between Lakeflow Connect’s incremental ingestion and use cases that still need sub-second latency?

Curious to hear what others have seen.

Thanks for reading.

Your data is clean. But who's accessing it, and how? Governing your Lakehouse with Unity Catalog

savlahanish27 — Wed, 17 Jun 2026 10:14:12 GMT

Nobody told the analytics team they couldn't query the raw customer table. So, they did.

Full names, email addresses, phone numbers - exported to a CSV for "a quick look." No alert fired. No one flagged it. We found out three weeks later during a compliance review.

The pipeline was solid. Months of work. Clean transformations, reliable runs, well-structured tables. We just never thought about who could actually reach in and pull the data out.

That's the part that gets skipped. You build a great pipeline and assume the work is done. But in retail - where you're sitting on customer PII, order history, and payment data - the pipeline is only half the job. The other half is knowing exactly who can see what, where data came from, and having an answer ready when compliance asks.

"We trust our team" doesn't hold up in an audit.

Unity Catalog is what fixed this for us. I've been running it in production for about a year, and the biggest change isn't a feature - it's that access control stops being something you bolt on after the pipelines are built and becomes part of how the platform works. This post covers the three things I use most: PII tagging, lineage tracking, and row-level security. All with working SQL you can adapt directly.

A quick picture of what Unity Catalog does

Unity Catalog introduces a three-level namespace on top of your existing Databricks workspace:

catalog └── schema (database) └── table / view / volume

So instead of database.table, you now reference catalog.schema.table - something like retail_prod.sales.orders. This might feel like extra typing at first, but it's what makes centralized governance possible - a single catalog with one permission model covering all your workspaces.

Tagging PII columns - know what you're carrying

The first thing I did when setting up Unity Catalog was tag every column that carries personally identifiable information. Not because anyone asked me to - just because it's harder to lock down data you haven't mapped.

UC has a built-in tag system - catalog, schema, table, or column level. For PII I go straight to column level - it's the most precise, and it gives you a queryable inventory of sensitive fields across your entire platform.

Here's how to create and assign a PII tag:

-- sql -- Create a tag in your catalog CREATE TAG IF NOT EXISTS pii_category ALLOWED_VALUES 'name', 'email', 'phone', 'address', 'payment'; -- Apply tags to sensitive columns ALTER TABLE retail_prod.customers.profiles ALTER COLUMN full_name SET TAGS ('pii_category' = 'name'); ALTER TABLE retail_prod.customers.profiles ALTER COLUMN email_address SET TAGS ('pii_category' = 'email'); ALTER TABLE retail_prod.customers.profiles ALTER COLUMN phone_number SET TAGS ('pii_category' = 'phone'); ALTER TABLE retail_prod.orders.transactions ALTER COLUMN card_last_four SET TAGS ('pii_category' = 'payment');

Once tagged, this query gives you a full map of every sensitive column across your platform:

-- sql SELECT table_catalog, table_schema, table_name, column_name, tag_value AS pii_type FROM system.information_schema.column_tags WHERE tag_name = 'pii_category' ORDER BY table_catalog, table_schema, table_name;

Run that and you have a full inventory of sensitive columns. The first time compliance asked us where customer email appeared across the platform, that query answered it in about ten seconds.

Worth being clear on though: tags are metadata only - they don't restrict access by themselves. What they do is give you the foundation to build policies on top of, which is where the next two sections come in.

Lineage - where did this data come from?

Every time a Delta Live Tables pipeline runs - or any notebook, job, or SQL query that reads from and writes to UC-managed tables - Unity Catalog automatically captures the data flow. No configuration required. You get column-level lineage out of the box.

The real pain shows up when an analyst reports the lifetime_value column looks off. Without lineage you're manually tracing through notebooks and pipeline code trying to figure out what fed it. With lineage, you open Catalog Explorer, click the column, and see the exact chain - Silver table, Bronze table, raw source file. Done in seconds instead of an hour.

The same lineage is queryable directly if you need it in a script or dashboard:

-- sql -- Find all upstream tables feeding into the Gold customer table SELECT source_table_full_name, target_table_full_name, created_at FROM system.access.table_lineage WHERE target_table_full_name = 'retail_prod.gold.customer_lifetime_value' ORDER BY created_at DESC;

Lineage also works in reverse. If you're about to deprecate a Bronze table, you can check what downstream assets depend on it before you touch anything:

-- sql -- What tables depend on our Bronze orders table? SELECT source_table_full_name, target_table_full_name FROM system.access.table_lineage WHERE source_table_full_name = 'retail_prod.bronze.raw_orders' ORDER BY target_table_full_name;

No more "I deleted a table and three pipelines broke" incidents.

Fine-grained access control - who sees what

Permissions in UC are just SQL. You grant privileges at any level of the hierarchy - catalog, schema, or table - and they inherit downward:

-- sql -- Analytics team gets read access to Gold only GRANT USE CATALOG ON CATALOG retail_prod TO `analytics-team`; GRANT USE SCHEMA ON SCHEMA retail_prod.gold TO `analytics-team`; GRANT SELECT ON SCHEMA retail_prod.gold TO `analytics-team`; -- Data engineers get full access to Bronze and Silver GRANT ALL PRIVILEGES ON SCHEMA retail_prod.bronze TO `data-engineering`; GRANT ALL PRIVILEGES ON SCHEMA retail_prod.silver TO `data-engineering`; -- Block direct access to raw PII table REVOKE SELECT ON TABLE retail_prod.customers.profiles FROM `analytics-team`;

The analytics team can query all Gold tables. They cannot touch raw Bronze data or the customers PII table directly. Clean separation, enforced at the platform level.

Row-level security with dynamic views

Table-level permissions are a blunt instrument. Sometimes you need more precision - regional managers should only see orders from their own region, or a customer support team should only see records for accounts they're assigned to.

The trick is is_account_group_member() - it checks the caller's group at query time, so the same view returns different rows for different people:

-- sql CREATE OR REPLACE VIEW retail_prod.sales.regional_orders AS SELECT order_id, order_date, customer_id, product_sku, order_amount, region FROM retail_prod.silver.orders WHERE CASE WHEN is_account_group_member('region-apac') THEN region = 'APAC' WHEN is_account_group_member('region-emea') THEN region = 'EMEA' WHEN is_account_group_member('region-us') THEN region = 'US' WHEN is_account_group_member('data-engineering') THEN TRUE ELSE FALSE END; -- Grant access to the view, not the underlying table GRANT SELECT ON VIEW retail_prod.sales.regional_orders TO `analytics-team`;

Same query, same view - different results based on who's asking. The underlying Silver table stays locked down.

Column masking for PII

One more pattern I use heavily: column masking. Instead of hiding an entire column, you partially mask it depending on who's querying. We use this for customer support - they need to verify who someone is, but there's no reason they should be able to export a raw email list.

-- sql -- Create a masking policy CREATE MASKING POLICY retail_prod.security.email_mask AS (email STRING) RETURNS STRING -> CASE WHEN is_account_group_member('data-engineering') THEN email WHEN is_account_group_member('customer-support') THEN CONCAT(LEFT(email, 2), '****', SUBSTRING_INDEX(email, '@', -1)) ELSE '****@****.***' END; -- Apply the masking policy to the column ALTER TABLE retail_prod.customers.profiles ALTER COLUMN email_address SET MASKING POLICY retail_prod.security.email_mask;

A data engineer sees the full email. A customer support agent sees jo****@gmail.com. Anyone else sees ****@****.***. One table, one policy, three different experiences - and zero copies of the data floating around.

Auditing - who actually accessed what

There's one more thing most people skip: checking whether any of this is actually being used. Unity Catalog logs every access event to the system audit tables:

-- sql SELECT user_identity.email AS user_email, action_name, request_params.table_full_name AS table_accessed, event_time FROM system.access.audit WHERE request_params.table_full_name = 'retail_prod.customers.profiles' AND event_time >= CURRENT_TIMESTAMP - INTERVAL 7 DAYS AND action_name IN ('SELECT', 'READ') ORDER BY event_time DESC;

Run this weekly, pipe it into a Databricks SQL dashboard, and you have an access audit trail that satisfies most compliance requirements without any custom logging infrastructure.

The mistakes I'd save you from

Don't skip the catalog hierarchy design. Once you have tables in Unity Catalog, restructuring the catalog/schema layout is painful. Spend an hour upfront deciding how you'll organize catalogs - by environment, by domain - before you start creating tables.

Tags alone don't protect data. I've seen teams tag all their PII columns and then consider the job done. Tags are discovery and documentation - they don't enforce anything. Pair them with masking policies and view-based access control.

is_account_group_member checks group membership at query time. This is a feature, not a bug - add a user to a group and their access updates immediately without changing any views or policies. But it also means if you remove someone from a group, they lose access instantly. Make sure your group membership is managed carefully. I'd rather have a slightly annoying offboarding checklist than discover an ex-employee still has access to the customer table three months later.

Test your dynamic views as a non-admin. It's easy to build a row-level security view, test it as an admin - who often bypasses restrictions - and ship it thinking it works. Always verify by impersonating a user in the target group.

What this actually changes

The part that surprised me most wasn't the technical setup - it was how much easier governance conversations became once everything lived in the platform rather than in a spreadsheet. People can't work around it accidentally. Access is granted explicitly, audited automatically, and revoked cleanly.

In retail, where you're handling customer PII and payment data, getting this wrong isn't just a technical problem. It shows up in audits, in compliance reviews, in the conversation nobody wants to have at 9am on a Monday after a data breach.

If I had to do this again from scratch, I'd do it in this order:

Tag PII columns first - you need to know what you're protecting before you can protect it
Get lineage working early - it's much more useful for preventing incidents than for investigating them after the fact
Build access control into views and masking policies, not just table-level grants
Check the audit logs regularly, even when nothing seems wrong - that's usually when you find something useful

If you're setting up Unity Catalog for the first time or migrating from the legacy Hive metastore, start small. Pick one domain, one catalog, get the permissions right, then expand. Trying to govern everything at once leads to over-complicated structures that nobody maintains.

Drop a comment if you've hit any Unity Catalog edge cases in production - particularly around external tables or cross-workspace sharing. Always curious what others have run into.

Building Production-Ready SDP Pipelines with Genie Code: The Complete Guide

shwetav1407 — Tue, 16 Jun 2026 18:22:36 GMT

How Databricks’ AI agent transforms data engineering from manual craftsmanship into conversational pipeline development

Data engineers have long accepted a painful truth: building production-grade ETL pipelines means wrestling with hundreds of lines of orchestration code, manually encoding execution order, handling incremental processing logic, and then praying nothing breaks at 2 AM. Spark Declarative Pipelines (SDP) already simplified this dramatically by letting you declare what your data should look like rather than how to get there. Now, with Genie Code in Agent mode, you don’t even have to write those declarations yourself.

Press enter or click to view image in full size

In this guide, we’ll walk through building a complete medallion architecture pipeline using Genie Code and SDP — from raw ingestion through business-ready analytics — and explore the patterns that make this approach production-worthy.

What Is SDP, and Why Should You Care?

Lakeflow Spark Declarative Pipelines (SDP) is Databricks’ framework for building batch and streaming data pipelines in SQL and Python. Unlike traditional Spark jobs where you manually define execution order, manage checkpoints, and handle retries, SDP lets you declare your transformations and handles the orchestration automatically.

The key benefits that matter for real-world pipelines:

Automatic orchestration — SDP analyzes dependencies across all your source files, builds a dataflow graph, and determines the optimal execution order with maximum parallelism. It also retries failures at the most granular level possible: first the Spark task, then the flow, then the pipeline.
Incremental processing built in — Materialized views automatically process only new data and changes. No more writing MERGE statements by hand.
Data quality as code — Expectations let you define quality constraints inline, right next to your transformations.
Unified batch and streaming — Toggle between batch and streaming processing modes with a single keyword change.

Here’s what that looks like compared to traditional approaches:

The Old Way (PySpark + Manual Orchestration)

# Hundreds of lines for a simple weekly sales pipeline
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, window
from delta.tables import DeltaTable

spark = SparkSession.builder.getOrCreate()

# Step 1: Read raw data (manually handle incremental)
raw_df = spark.read.format("delta").load("/data/raw_sales")
last_processed = spark.read.format("delta") \
    .load("/checkpoints/last_ts").collect()[0][0]
new_data = raw_df.filter(col("event_time") > last_processed)

# Step 2: Clean (manually write quality checks)
cleaned = new_data.filter(
    col("amount").isNotNull() & 
    (col("amount") > 0)
)

# Step 3: Aggregate (manually handle upserts)
weekly = cleaned.groupBy(
    window("event_time", "1 week"), "region"
).agg(sum("amount").alias("total_sales"))

# Step 4: Write (manually handle merge)
target = DeltaTable.forPath(spark, "/data/weekly_sales")
target.alias("t").merge(
    weekly.alias("s"),
    "t.window = s.window AND t.region = s.region"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

# Step 5: Update checkpoint (manually track state)
# ... plus an Airflow DAG for scheduling, retries, alerting

The SDP Way (SQL)

-- The entire pipeline in a few declarations

-- Bronze: raw ingestion with Auto Loader
CREATE OR REFRESH STREAMING TABLE bronze_sales
AS SELECT * FROM STREAM read_files(
  '/data/landing/sales/',
  format => 'json',
  schema => 'event_time TIMESTAMP, region STRING, 
             product STRING, amount DOUBLE'
);

-- Silver: cleansed with quality expectations
CREATE OR REFRESH STREAMING TABLE silver_sales (
  CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW,
  CONSTRAINT not_null_region EXPECT (region IS NOT NULL) ON VIOLATION DROP ROW
)
AS SELECT
  event_time,
  region,
  product,
  amount,
  current_timestamp() AS processed_at
FROM STREAM(bronze_sales);

-- Gold: business-ready weekly aggregation
CREATE OR REFRESH MATERIALIZED VIEW gold_weekly_sales
AS SELECT
  date_trunc('week', event_time) AS week_start,
  region,
  COUNT(*) AS transaction_count,
  SUM(amount) AS total_sales,
  AVG(amount) AS avg_transaction
FROM silver_sales
GROUP BY date_trunc('week', event_time), region;

That’s it. SDP handles incremental processing, execution order, retries, and checkpoint management. The bronze and silver tables use streaming semantics (the STREAM keyword), while the gold materialized view uses batch semantics but still only reprocesses changed data.

Enter Genie Code: Your AI Data Engineering Partner

Now here’s where it gets interesting. Genie Code in Agent mode — available inside the Lakeflow Pipelines Editor — doesn’t just help you write SDP code. It can autonomously plan, generate, run, validate, and fix entire pipelines from a single natural language prompt.

How Genie Code Agent Mode Works

When you enable Agent mode in the Genie Code panel within the Lakeflow Pipelines Editor, the agent adapts its capabilities specifically for data engineering tasks. Unlike chat mode, Agent mode can:

Plan a multi-step solution and present it for your review
Search your Unity Catalog for relevant tables, schemas, and lineage
Generate SQL or Python SDP source files in the pipeline editor
Run pipeline updates and read the output datasets
Diagnose and fix errors automatically, iterating until the pipeline succeeds
Respect your Unity Catalog permissions — it can only access data you can access

The key design principle is human-in-the-loop: Genie Code proposes plans and asks for approval before executing. You can Allow, Decline, or ask it to try a different approach.

Tutorial: Building a Medallion Pipeline with Genie Code

Let’s walk through building a real pipeline — an e-commerce analytics pipeline that ingests order data, cleans and enriches it, and produces dashboards-ready metrics.

Prerequisites

A Databricks workspace with Partner-powered AI features enabled
Access to the Lakeflow Pipelines Editor
Unity Catalog configured with a target catalog and schema

Step 1: Create Your Pipeline and Open Genie Code

Navigate to Pipelines in the sidebar and create a new pipeline. Give it a name like ecommerce_analytics and set your target catalog and schema (e.g., analytics.ecommerce).

Once in the Lakeflow Pipelines Editor, open the Genie Code panel and switch to Agent mode.

Step 2: Prompt Genie Code to Build the Pipeline

Start with a descriptive prompt that tells Genie Code what you want:

Your prompt: “Build a medallion architecture pipeline for e-commerce analytics. I have raw order data landing as JSON files in /Volumes/raw_data/orders/ with fields: order_id, customer_id, product_id, quantity, unit_price, order_timestamp, and shipping_region. Create bronze ingestion with Auto Loader, silver cleansing with quality expectations, and gold aggregations for daily revenue by region and top products.”

Genie Code will create a step-by-step plan that looks something like:

Plan:
1. Search Unity Catalog for existing related tables
2. Create bronze_orders.sql — streaming table with Auto Loader
3. Create silver_orders.sql — cleaned data with expectations
4. Create gold_daily_revenue.sql — daily revenue by region
5. Create gold_top_products.sql — top products materialized view
6. Run the pipeline and validate outputs

Review the plan, ask clarifying questions if needed, then select Allow to let Genie Code proceed.

Step 3: Watch Genie Code Generate Your Pipeline

Genie Code creates each source file in your pipeline. Here’s what the generated code typically looks like:

File: bronze_orders.sql

-- Bronze layer: raw ingestion from JSON landing zone

CREATE OR REFRESH STREAMING TABLE bronze_orders
COMMENT 'Raw e-commerce orders ingested via Auto Loader'
AS SELECT
  *,
  _metadata.file_name AS source_file,
  _metadata.file_modification_time AS file_mod_time,
  current_timestamp() AS ingestion_timestamp
FROM STREAM read_files(
  '/Volumes/raw_data/orders/',
  format => 'json',
  inferColumnTypes => 'true'
);

File: silver_orders.sql

-- Silver layer: cleansed and validated orders

CREATE OR REFRESH STREAMING TABLE silver_orders (
  CONSTRAINT valid_order_id 
    EXPECT (order_id IS NOT NULL) ON VIOLATION DROP ROW,
  CONSTRAINT valid_quantity 
    EXPECT (quantity > 0 AND quantity < 10000) ON VIOLATION DROP ROW,
  CONSTRAINT valid_price 
    EXPECT (unit_price > 0) ON VIOLATION DROP ROW,
  CONSTRAINT valid_timestamp 
    EXPECT (order_timestamp IS NOT NULL) ON VIOLATION DROP ROW,
  CONSTRAINT valid_region 
    EXPECT (shipping_region IS NOT NULL) ON VIOLATION FAIL UPDATE
)
COMMENT 'Cleansed orders with quality expectations enforced'
AS SELECT
  order_id,
  customer_id,
  product_id,
  CAST(quantity AS INT) AS quantity,
  CAST(unit_price AS DOUBLE) AS unit_price,
  CAST(quantity AS INT) * CAST(unit_price AS DOUBLE) AS line_total,
  CAST(order_timestamp AS TIMESTAMP) AS order_timestamp,
  UPPER(TRIM(shipping_region)) AS shipping_region,
  ingestion_timestamp
FROM STREAM(bronze_orders);

File: gold_daily_revenue.sql

-- Gold layer: daily revenue metrics by region

CREATE OR REFRESH MATERIALIZED VIEW gold_daily_revenue
COMMENT 'Daily revenue aggregation by shipping region'
AS SELECT
  DATE(order_timestamp) AS order_date,
  shipping_region,
  COUNT(DISTINCT order_id) AS total_orders,
  COUNT(DISTINCT customer_id) AS unique_customers,
  SUM(line_total) AS total_revenue,
  AVG(line_total) AS avg_order_value,
  SUM(quantity) AS total_units_sold
FROM silver_orders
GROUP BY DATE(order_timestamp), shipping_region;

File: gold_top_products.sql

-- Gold layer: top products by revenue

CREATE OR REFRESH MATERIALIZED VIEW gold_top_products
COMMENT 'Product performance ranked by total revenue'
AS SELECT
  product_id,
  COUNT(DISTINCT order_id) AS times_ordered,
  SUM(quantity) AS total_units,
  SUM(line_total) AS total_revenue,
  AVG(unit_price) AS avg_price
FROM silver_orders
GROUP BY product_id;

Step 4: Genie Code Runs and Validates

After generating the files, Genie Code asks for permission to run the pipeline. Once you approve it:

Triggers a pipeline update
Monitors execution across all flows
Reads the output datasets to verify data landed correctly
Reports back with row counts, any expectation violations, and the DAG structure

If something fails — say a schema mismatch in the JSON files — Genie Code diagnoses the error, proposes a fix (like adjusting the schema inference or adding a CAST), and iterates until the pipeline succeeds.

Going Deeper: Python SDP with Genie Code

While SQL is the most common approach, SDP also supports Python for more complex transformation logic. The Python API uses decorators from the pyspark.pipelines module (imported as dp).

Here’s what a Python-based silver layer might look like when you need custom transformation logic:

from pyspark import pipelines as dp
from pyspark.sql.functions import col, upper, trim, when, lit

@dp.table(
    name="silver_orders_enriched",
    comment="Orders enriched with derived customer segments"
)
@dp.expect("valid_order_id", "order_id IS NOT NULL", on_violation="drop")
@dp.expect("valid_amount", "line_total > 0", on_violation="drop")
def silver_orders_enriched():
    return (
        spark.readStream.table("bronze_orders")
        .withColumn("line_total", col("quantity") * col("unit_price"))
        .withColumn("shipping_region", upper(trim(col("shipping_region"))))
        .withColumn(
            "customer_segment",
            when(col("line_total") >= 500, lit("premium"))
            .when(col("line_total") >= 100, lit("standard"))
            .otherwise(lit("basic"))
        )
        .withColumn("order_date", col("order_timestamp").cast("date"))
    )

You can ask Genie Code specifically for Python implementations:

Your prompt: “Add a Python-based silver transformation that enriches orders with a customer loyalty tier based on historical order count from the customers table in analytics.core.”

Genie Code will search your Unity Catalog for the customers table, understand its schema, and generate a Python file that joins and enriches appropriately.

Handling Change Data Capture (CDC)

One of SDP’s most powerful features is AUTO CDC, which handles change data capture with full support for out-of-order events. This is where things get genuinely hard in traditional pipelines — and trivial in SDP.

SQL example for CDC with SCD Type 2:

-- Streaming table to capture raw CDC events

CREATE OR REFRESH STREAMING TABLE customers_cdc_raw
AS SELECT * FROM STREAM read_files(
  '/Volumes/raw_data/customers_cdc/',
  format => 'json'
);

-- Cleansed CDC with expectations

CREATE OR REFRESH STREAMING TABLE customers_cdc_clean (
  CONSTRAINT valid_id EXPECT (customer_id IS NOT NULL) ON VIOLATION DROP ROW
)
AS SELECT
  customer_id,
  name,
  email,
  address,
  operation,
  operation_timestamp
FROM STREAM(customers_cdc_raw);

-- Apply CDC changes with SCD Type 2 history tracking

CREATE OR REFRESH STREAMING TABLE customers;

AUTO CDC INTO customers
FROM STREAM(customers_cdc_clean)
KEYS (customer_id)
SEQUENCE BY operation_timestamp
STORED AS SCD TYPE 2;

You can prompt Genie Code with something like:

Your prompt: “Add change data capture for customer updates from Debezium CDC events. I need SCD Type 2 to track historical changes to customer addresses.”

Genie Code understands the CDC patterns and generates the appropriate AUTO CDC declarations.

Data Quality Expectations: Your Safety Net

Expectations are SDP’s built-in data quality framework. There are three violation behaviors:


Behavior What Happens Use When
ON VIOLATION DROP ROW Invalid rows are silently dropped Tolerating messy source data
ON VIOLATION FAIL UPDATE Entire pipeline update fails Critical fields that must exist
(no action specified) Invalid rows are logged but kept Monitoring without blocking

Pro tip: Use Genie Code to add expectations iteratively. After an initial pipeline run, ask:

“Analyze the bronze_orders data and suggest quality expectations for the silver layer based on the actual data distribution.”

Genie Code can read the output datasets, profile the data, and propose expectations that make sense for your actual data — not just generic null checks.

Production Patterns and Best Practices

1. Pipeline Configuration with YAML Spec

Your pipeline project uses a YAML spec file for top-level configuration:

# pipeline.yaml
name: ecommerce_analytics
target_catalog: analytics
target_schema: ecommerce
libraries:
  - path: ./bronze_orders.sql
  - path: ./silver_orders.sql
  - path: ./gold_daily_revenue.sql
  - path: ./gold_top_products.sql
configuration:
  spark.sql.shuffle.partitions: "auto"

2. Parameterize with SET

Use SET to inject environment-specific configurations:

SET env = 'production';
SET raw_path = '/Volumes/${env}/raw_data/orders/';

CREATE OR REFRESH STREAMING TABLE bronze_orders
AS SELECT * FROM STREAM read_files(
  '${raw_path}',
  format => 'json'
);

3. Mix SQL and Python Files

A single pipeline can contain both SQL and Python source files. Use SQL for straightforward transformations and Python when you need UDFs, ML feature engineering, or complex business logic.

4. Use Genie Code for Ongoing Maintenance

Genie Code doesn’t just build pipelines — it monitors them. It can:

Triage failures when a pipeline run breaks
Investigate anomalies in data quality metrics
Handle DBR upgrades by updating deprecated syntax
Optimize resource allocation based on observed workload patterns

Ask it things like:

“The silver_orders pipeline has been failing since yesterday. Diagnose the issue.”

“Optimize the compute configuration for this pipeline — it’s running slowly on large backfills.”

Wrapping Up

The combination of SDP and Genie Code represents a genuine paradigm shift for data engineering on Databricks. SDP eliminates the boilerplate of pipeline orchestration, and Genie Code eliminates the boilerplate of writing SDP. What used to take days of manual pipeline construction can now happen in a single conversation.

The key takeaways:

Start with SDP — even without Genie Code, the declarative approach saves enormous amounts of manual orchestration code.
Use Genie Code Agent mode in the Lakeflow Pipelines Editor to plan, generate, and validate entire pipelines from natural language.
Build iteratively — start with a basic bronze-silver-gold structure, then ask Genie Code to add CDC handling, expectations, and enrichments.
Trust the loop — Genie Code’s ability to run the pipeline, read outputs, diagnose errors, and fix them autonomously is the real superpower.
Keep humans in control — every destructive action requires your approval. Genie Code proposes; you decide.

SDP and Genie Code are both generally available today at no additional cost for all Databricks customers. Open the Lakeflow Pipelines Editor, flip on Agent mode, and start talking to your data infrastructure.

Ready to get started? Check out the Databricks SDP documentation and the Genie Code guide for pipeline development.

The Comparison: Why the Alternatives Fall Short for Databricks-Native Agentic Systems

Agre_Celebal — Tue, 16 Jun 2026 17:48:47 GMT

The OLTP architecture your agentic systems actually need, and how it compares to Supabase, Azure PostgreSQL, and Cosmos DB

Earlier this year, Nikita Shamgunov — the engineer leading Databricks Lakebase — published a number that reframed my entire architecture review: AI agents now create roughly 4x more databases than human developers.

Not 4x more queries. 4x more databases.

If you're building agentic AI systems on Databricks and still reaching for Supabase, Azure Database for PostgreSQL, or Cosmos DB as your OLTP layer — this article will challenge that decision. Not because those platforms are bad. They're not. But because they were designed for a world where humans write schemas, humans provision databases, and humans decide when something scales. Agents don't work that way. And the architecture that serves human-paced development quietly breaks under agentic workloads.

I learned this the hard way while building an internal Agentic Intelligence Platform at Celebal Technologies — three agent modules (Swarm Coordination, Ontology-Based Reasoning, and Causal Optimization) sharing a unified LLMOps spine on Databricks. I'll show you exactly what I got wrong in the database layer, what Lakebase changes, and how the alternatives stack up for teams building enterprise AI on Databricks.

The Problem: Agents Don't Use Databases the Way Humans Do

Traditional database architecture assumes a human-paced world. Applications write transactions. Dashboards read. ETL pipelines shuttle data between the OLTP and OLAP layers. The entire stack was designed around predictable access patterns and a well-understood divide between operational and analytical data.

Agents shatter all three of those assumptions simultaneously.

They're inherently ephemeral. A swarm agent coordinating a supply chain analysis spins up, decomposes a task across five specialist agents, writes hundreds of state checkpoints, and terminates — all in under thirty seconds. The next invocation may run on a completely different thread with zero shared context from the prior session. Legacy databases aren't built for disposable, bursty compute that needs to scale to zero between workloads and spin back up instantly for the next one.

They generate massive, high-frequency state churn. Every tool call, reasoning step, context retrieval, and handoff between agents is a potential checkpoint. For a multi-turn swarm agent handling a complex analytical task, that's hundreds of writes per session — each requiring exact-ID retrieval by thread_id or session_id, not vector similarity search. Postgres handles this natively. A Delta table, even a well-ZORDER'd one, adds overhead for an access pattern it was never designed to serve.

They need to reach analytical data without crossing a platform boundary. An agent recommending inventory adjustments needs to query the Gold Delta tables — the same tables your ML models trained on, governed by the same Unity Catalog policies your data engineering team enforces. If your OLTP layer lives outside Databricks, you're building a data copy pipeline just so your agent can read data that's already on the platform.

That third problem is where I went wrong.

The Mistake I Made Building the Agentic Platform

When I built the Swarm Coordination module of our Agentic Intelligence Platform, I used a Unity Catalog Delta table as the shared persistent memory store for multi-turn agent sessions. Delta was a reasonable first choice — it gave me time travel for session debugging, UC lineage on every agent write, and the ability to query session history in SparkSQL.

But Delta is an OLAP-optimized storage format. When the coordinator agent needed to retrieve the exact current state for a specific thread_id, it was running a scan-optimized query engine against a point-lookup workload. I added ZORDER on (session_id, turn_number) and tuned file sizes — which helped. But it was always the wrong tool for the access pattern.

What the architecture actually needed was a clean separation of concerns:

Short-term session state (checkpoints, thread context, current turn, handoff records) → a transactional store with exact-ID retrieval and sub-10ms read latency
Long-term episodic memory (past session summaries, cross-session reasoning patterns, performance analytics) → Delta Lake, where batch SparkSQL queries and Lakehouse Monitoring make sense

Lakebase is the transactional half of that equation. And it's the piece I didn't have.

What Lakebase Provides for Agentic Systems

Lakebase is Databricks' fully managed, serverless PostgreSQL database — built on the Neon architecture (which Databricks acquired) and integrated natively into the Databricks platform. It reached General Availability in February 2026. Here are the capabilities that directly change the agent architecture:

Native LangGraph Checkpointing

Lakebase is a supported LangGraph checkpointer backend on both Databricks Apps and Model Serving endpoints. Authentication between your application and Lakebase is resolved automatically through the platform's Service Principal — no credential management in application code, no secret rotation for a separate database connection string.

from langgraph.checkpoint.postgres import PostgresSaver
from databricks.sdk import WorkspaceClient

# Databricks resolves authentication automatically via Service Principal
w = WorkspaceClient()
conn_str = w.lakebase.get_connection_string(instance_name="agent-state-prod")

# LangGraph Postgres checkpointer backed by Lakebase
checkpointer = PostgresSaver.from_conn_string(conn_str)

# The agent now has durable, OLTP-grade session state
agent = create_react_agent(model, tools, checkpointer=checkpointer)

This is the pattern you'd apply to the Swarm Coordination module. The coordinator's session state — which agent it's routing to, which specialist has already responded, the current confidence score — lives in Lakebase. The MLflow Trace of the full execution graph is separate (logged as a Databricks artifact). Two different concerns, two different stores, each doing what it does best.

Instant Database Branching for Agent Experimentation

This is the capability that directly addresses the "4x more databases" pattern. Lakebase supports copy-on-write branching: a full, isolated branch of a production-scale database in under one second, at near-zero initial storage cost (only diffs are written on change).

For agents, this changes what's possible:

A Causal Optimization agent running counterfactual "what-if" scenarios can branch the intervention state, explore the outcome, and discard the branch — without any risk to the production state
An agent autonomously testing schema migrations can branch, run the migration, validate, and either promote or roll back in a single API call
Development environments for agent workflows are ephemeral by default, provisioned and torn down programmatically

Databricks telemetry shows production Lakebase deployments averaging roughly 10 branches per database project, with some agent-driven workflows reaching hundreds of nested iterations. That pattern is structurally impossible with traditional managed Postgres where creating a copy requires duplicating the full storage filesystem.

Autoscaling with Scale-to-Zero

Agent workloads are bursty in a way that application workloads rarely are. Thousands of concurrent sessions during business hours, complete silence at 2am. Lakebase scales its compute up under load and down to zero between workloads — costs align with actual usage, not provisioned capacity. For multi-agent platforms running on Databricks Apps, this means the transactional backend matches the compute model of the application layer itself.

Managed Delta Sync — The ETL Eliminator

Every write to Lakebase is automatically synced to Delta tables in Unity Catalog. For agent systems, this is what closes the long-term memory loop without custom code:

Agent session checkpoints (short-term) → Lakebase → automatic Delta sync → Gold layer for analysis
Lakehouse Monitoring can track agent reasoning drift, latency patterns, and success rate from the Delta-synced inference data
The grid operations team in our Solar Forecasting project needed low-latency reads on Gold forecast data — we built a data copy pipeline as a workaround that added latency and a maintenance surface.

Unity Catalog as the Single Governance Layer

Lakebase instances are registered in Unity Catalog under the same 3-level namespace as your Delta tables and ML models. The same row-level security policies, column masking, lineage graphs, and access audit logs that govern energy_nz.solar.gold also govern the Lakebase instance storing agent session state. For enterprise AI systems operating under regulatory oversight, this is a structural requirement — not a preference.

The Comparison: Why the Alternatives Fall Short for Databricks-Native Agentic Systems

Supabase is an excellent platform for its target use case. Postgres, auth, storage, real-time subscriptions, and edge functions bundled into a working backend in minutes — at $25/month, it's exceptionally competitive for early-stage web applications. But for enterprise agentic systems on Databricks, there are two structural gaps that don't close with configuration: there is no Unity Catalog (agents operating on governed enterprise data need the same governance layer as the data itself), and there is no Lakehouse sync (analytical data still requires an ETL pipeline to reach Supabase, and Supabase data requires an ETL pipeline to reach the Lakehouse for monitoring and ML). Supabase asks you to build and maintain that bridge. Lakebase eliminates it.

Azure Database for PostgreSQL Flexible Server is a solid choice for traditional Azure-native transactional workloads. But compute and storage are coupled together — creating an isolated development copy of a production database requires duplicating the full storage volume, an operation measured in hours and charged by the gigabyte. There is no native database branching, no Lakehouse sync, and the governance model (Azure RBAC) is entirely separate from Unity Catalog. For teams building on Azure Databricks who want a single governance boundary across OLTP, OLAP, and ML — this means managing two different access control systems with no native bridge between them.

Azure Cosmos DB is purpose-built for globally distributed, multi-region, flexible-schema NoSQL workloads — a genuinely different problem from agentic state management. It's not PostgreSQL-compatible, which means LangGraph's Postgres checkpointer doesn't apply, standard psycopg2 drivers don't connect, and the document model doesn't naturally represent the relational shape of session checkpoints and handoff records. Cosmos DB is the right answer for a different question.

What I'd Rebuild in the Agentic Platform

With Lakebase available, the architecture for the three modules changes specifically:

Module 1 — Swarm Coordination:

Coordinator checkpoint store → Lakebase: thread state, current turn context, handoff records, confidence scores per routing decision. LangGraph Postgres checkpointer on Databricks Apps, authentication via Service Principal.
Agent episodic memory → Delta Lake (unchanged): cross-session analytical queries, SHAP analysis across sessions, Lakehouse Monitoring on reasoning patterns. Lakebase managed sync keeps Delta current automatically.

Module 2 — Ontology-Based Reasoning:

Ontology triples → Delta (unchanged): batch reads by the re-ranking gate, SQL queries for sub-graph retrieval. No change needed here — this is an OLAP access pattern.
Grounding cache → Lakebase: frequently accessed ontology sub-graphs cached in Postgres for sub-50ms retrieval during the agent's inner reasoning loop.

Module 3 — Causal Optimization:

Intervention results → Lakebase → managed Delta sync: causal engine writes intervention outcomes (high-frequency, transactional) to Lakebase. Sync pushes results to the Gold Delta layer for downstream analytics without custom ETL.
Causal DAG structure → Delta (unchanged): the DAG (edges, confidence scores, version history) is read by batch retraining jobs after PSI-triggered re-learning. Delta time travel for DAG versioning is already the right pattern here.

The net effect: short-term transactional operations at Postgres latency, long-term analytical operations at Delta scale, a single Unity Catalog governance layer across both, and zero custom ETL pipelines connecting them.

When Lakebase Isn't the Answer

A credible recommendation has boundaries. Lakebase is not the right choice when:

Your OLTP workload is genuinely independent of analytics — a standalone web app with no ML components or Lakehouse integration doesn't benefit from the co-location.
You need niche Postgres extensions not yet supported in Lakebase's managed environment (specialized GIS, custom time-series extensions).
You're building a consumer-facing mobile application where Supabase's bundled auth, storage, and real-time subscriptions are the actual product value.
You're not on Databricks. The Lakehouse integration is the primary differentiation — without it, Lakebase is a well-engineered managed Postgres, but not a category-defining choice.

The decision criterion is simple: how close is your agent workload to your Databricks analytics and ML stack? The closer it is, the more Lakebase earns its place.

The Larger Picture

Databricks started as the platform where you process and model data. Unity Catalog is the platform where you govern data. Lakebase makes it the platform where you run transactional applications on that data — without copying it, without bridging governance models, without maintaining a second operational stack alongside your analytics stack.

The 4x database creation stat isn't a curiosity. It's a forcing function. When agents provision databases at that rate, every architectural inefficiency — the manual provisioning, the ETL pipeline, the separate governance model — compounds at agent speed. Human architects designed those inefficiencies in; agents will expose them.

After rebuilding the Agentic Platform architecture mentally with Lakebase in place, the change is not additive — it's structural. It's the difference between three systems (OLTP, OLAP, ML) connected by pipelines you maintain, and one platform where those boundaries exist only in your mental model.

If this resonated, I'd welcome your thoughts in the comments — especially if you've hit the OLTP/OLAP boundary problem in your own agentic architectures. What did your workaround look like?

Gold Layer Design on Databricks — MERGE vs Overwrite, Partitioning, SCD Type 2 from SAP

savlahanish27 — Mon, 15 Jun 2026 11:29:39 GMT

Part 3 of my series on building an enterprise data platform on Databricks is up - this one cover Gold layer design.

The short version: Gold isn't just aggregated Silver. Silver maps to your source system. Gold maps to the business questions your consumers are actually asking - and those two things are almost never the same shape.

What's in the post:

MERGE vs overwrites for Gold writes, and the threshold where we switched (40min overwrite runs on vendor_balance at ~10M rows)
Partitioning strategy for financial tables: BUKRS+GJAHR for period aggregates, BUKRS alone for balances, no partition on dimensions
Z-ordering on LIFNR+MONAT for finance report query patterns
SCD Type 2 from SAP master data using a validity window at Gold
What doesn't belong in Gold — and the two days we spent auditing a table we eventually deleted
Full vendor_balance Gold table in PySpark with MERGE pattern

This is Part 3 of 5. Parts 1 and 2 covered Bronze ingestion (GoldenGate + Kafka + Structured Streaming alongside JDBC historical load) and Silver reconciliation. Part 4 is about why three-layer medallion wasn't enough and what we added.

Full post: Designing the Gold Layer on Databricks — What Belongs and What Doesn’t

Happy to answer questions on any of the decisions — there were a few where we went back and forth longer than we should have.