Databricks Community

LingeshK · ‎05-19-2025

Introduction - From Reactive Firefighting to Proactive Monitoring

In many industries, data pipelines power critical decisions from fraud detection to inventory management. When these pipelines fail silently or degrade gradually, this could cause downstream problems such as misplaced customer trust, compliance violations, and operational bottlenecks.

Monitoring data workloads can be challenging if not done right, especially when issues are difficult to detect and resolve. Without adequate monitoring, teams may face unexpected downtime and spend significant time troubleshooting problems. DLT provides built-in monitoring features and integrates with Databricks’ native tools, making it easier to track pipeline health, identify bottlenecks, and respond quickly to errors. This approach helps ensure that data remains reliable and that pipelines run smoothly with minimal manual intervention.

Unlike fragmented approaches that force teams to cobble together logs, DLT bakes monitoring into the following layers:

Centralized data quality rules: This lets the team embed checks like “transaction amounts must be positive” directly into pipeline code, ensuring validation happens at the source.
Streaming metrics: DLT pipelines can now monitor streaming sources such as Apache Kafka and Auto Loader. Streaming metrics include backlog bytes, records, seconds, and files, which are displayed in real-time within the DLT UI.
Automatic Lineage mapping: This visually traces errors back to their origin.
Custom Monitoring: New capabilities allow for defining custom actions based on specific pipeline events. This is particularly useful for alerts on operational thresholds and other monitoring flexibilities. In addition to these native capabilities, teams can still leverage traditional Spark mechanisms such as StreamingQueryListener, cluster logs, and driver/executor logs for deeper insights, troubleshooting, and historical auditing.
Self-correcting workflows: DLT can automatically retry failed tasks or trigger alerts for unresolved issues, minimizing manual firefighting. If problems arise, DLT automatically maps errors back to their origin-for example, tracing a sudden spike in invalid customer IDs to a specific transformation step. For common hiccups like brief network outages, the system quietly retries tasks up to five times, alerting engineers only if problems persist.

In this article, we explore performance-focused monitoring best practices, highlighted through a real-life deployment example.

1. Monitoring Techniques with DLT Pipelines

As mentioned above, DLT has built-in monitoring, which includes DLT UI quality metrics, structured event logs, and system-level metadata tables. This allows data engineers to track pipeline health in a scalable way.

1.1 DLT Monitoring Features

Consider the following DLT Python logic that moves sample quote data from Bronze to Gold. We will use this example to guide our discussion and explore how the DLT UI and DLT expectations can help monitor the quality of the data being processed.

Quotes Processing Medallion Architecture

Below is the code block to read the data from the source table, main.pricing_data.quotes:

import dlt
from pyspark.sql.functions import *


@dlt.table(
 comment="Raw quotes data",
 table_properties={"quality": "bronze"})
def bronze_quotes():
   return spark.sql("select * from main.pricing_data.quotes")

The next block of code is used to transform the bronze data into silver and gold layers:

The silver hop is to add expectations on the bronze data, which tells DLT to expect a certain quality of data to pass through downstream. If there is any violation, we can control what happens to the insufficient data.
Finally, in the gold layer, we perform aggregations and make the data publish-ready.

# Silver Layer Table
@dlt.table(table_properties={"quality": "silver"})
@dlt.expect_or_drop("valid_pricing", "Quote_Price < 3900")
@dlt.expect_or_drop("valid_driver_age", "Driver_Age BETWEEN 17 AND 90")
def silver_quotes():
   return (
       dlt.read("bronze_quotes")
       .dropDuplicates(["ID"])
       .filter(
           (col("Annual_Mileage") > 0) &
           (col("Credit_Score").between(300, 850)) &
           (col("Vehicle_Type").isin(["Car", "Truck", "Motorbike"])))
       .withColumn("Quote_Price", col("Quote_Price").cast("integer"))
       .withColumn("Driver_Age", col("Driver_Age").cast("integer"))
       .withColumn("Date", col("Date").cast("date")))

# Gold Layer Table
@dlt.table(table_properties={"quality": "gold"})
def gold_vehicle_quotes():
   return (
       dlt.read("silver_quotes")
       .groupBy("Vehicle_Type")
       .agg(
           avg("Quote_Price").alias("Average_Premium"),
           count("ID").alias("Total_Policies")))

1.1.1 UI Quality Metrics

The default feature of observability in the DLT pipeline is achieved using a high-quality lineage diagram that provides visibility into how data flows for impact analysis.

When the pipeline is executed, the DLT UI displays the corresponding DAG, which serves as the first level of monitoring. This view provides an overview of the data volume being processed, particularly in the Silver layer. In this example, approximately 42k records were processed successfully, while around 8.3k records failed the quality checks.

DLT Pipeline DAG - Quotes Processing

If we click the silver_quotes materialized view, we can check out the “Data Quality” tab, which shows more granular information. It shows the user the number of records written into the target and those dropped. Additionally, it also indicates the expectations, namely valid_driver_age and valid_pricing, which drove these metrics.

Data Quality Statistics on silver_quotes table

1.1.2 DLT Expectations

DLT enforces data quality through declarative expectations that validate records during pipeline execution. The DLT UI's Data Quality tab displays real-time metrics for constraints defined via expect (retain), expect_or_drop, and expect_or_fail operators. For example, we have used the following operators to drop records that do not meet the valid cut-off age for people obtaining a driver’s quotation and to filter out high quote price values using the @expect_or_drop decorator on the Silver materialized view, silver_quotes.

# valid pricing expectation checks for the value of quote price
@dlt.expect_or_drop("valid_pricing", "Quote_Price < 3900")

# valid driver age expectation checks for the age of the driver
@dlt.expect_or_drop("valid_driver_age", "Driver_Age BETWEEN 17 AND 90")

More information and sample DLT exceptions can be found in this Databricks official documentation: Manage data quality with pipeline expectations

1.2 Event Log

Each DLT pipeline automatically creates an event log related to a pipeline, including audit logs, data quality checks, pipeline progress, and data lineage.

There are two ways to access this DLT event log: through the default publishing mode or the legacy publishing mode.

With default publishing mode, each DLT pipeline automatically creates a log capturing lifecycle events, expectation outcomes, errors, and operational metadata. If you’re building your pipeline via the Databricks UI, you can set this under “Advanced settings” in the pipeline configuration.

ETL Pipeline Create Configuration

If using JSON or API-based configuration, you can add the following code within your pipeline JSON script using:

"event_log": {
    "catalog": "main",
    "schema": "monitoring",
    "name": "event_log_quotes_pipeline"
}

This configuration will publish structured pipeline events into the specified Unity Catalog location under the catalog and schema specified, making them queryable like any other Delta table. The example below filters for error-level events from the past 24 hours, helping you quickly identify recent pipeline failures, expectation violations, or operational issues.

SELECT timestamp, event_type, message
FROM main.monitoring.event_log_quotes_pipeline
WHERE level = 'ERROR'
  AND timestamp > current_timestamp() - INTERVAL 24 HOURS
ORDER BY timestamp DESC;

With legacy publishing mode, you can utilize the event_log Table-Valued function (TVF) to fetch the event log for the pipeline. You retrieve the event log for a pipeline by passing the pipeline ID or a table name to the TVF. For example, to recover the event log records for the pipeline with ID "19c78331-2ea5-1973-a1b2-4dbef8b2198c", you can run the following:

SELECT * FROM event_log("event_log_19c78331_2ea5_1973_a1b2_4dbef8b2198c")

You can see if your pipeline is in legacy publishing mode if you see the following in your pipeline UI settings:

Additionally, for legacy publishing mode, remember the following pre-requisites:

The event_log TVF can be called only by the pipeline owner.
You cannot use the event_log table-valued function in a pipeline or query to access the event logs of multiple pipelines.
You cannot share a view created over the event_log table-valued function with other users.

This diagram illustrates how DLT in Databricks offers unified observability by combining visual pipeline monitoring, configurable event logging, and structured audit trails. A sample DLT pipeline with bronze, Silver, and gold layers shows real-time execution status and data volumes. Event log configuration allows users to define where logs are stored, such as in the Unity Catalog under the main.monitoring.event_log_sales_pipeline. The DLT UI displays a livestream of pipeline events, while the underlying Delta table captures structured log data including event type, timestamp, and user actions. This integration enables both immediate visual feedback and advanced programmatic monitoring, streamlining debugging, auditing, and alerting workflows.

DLT Monitoring Event Configuration

1.2.1 Dissecting the Event Logs for Non-Technical Personas

We can navigate to Genie space using the “Ask Genie” option in the Sample data pane enables users to interact with their datasets using natural language queries. We can ask queries as shown in the snippet below:

Sales Event Log Table Analysis via Genie Space

Genie dynamically adjusts its responses according to user feedback, which can be verified via the ‘Show code’ option.

1.2.2 Dissecting the Event Logs using the SQL Editor

Users can sometimes use the SQL Editor to create more granular queries that could power the AI/BI dashboards.

Consider this scenario - the user wants to get the count of the passing and failing records for all the Data Quality checks that are set in the DLT code. Then, the following query is one way of achieving this:

-- Create or replace a temporary view named latest_update that selects the most recent 'create_update' event
CREATE OR REPLACE TEMP VIEW latest_update AS SELECT origin.update_id AS id FROM `event_log_sales_pipeline`
WHERE event_type = 'create_update' ORDER BY timestamp DESC LIMIT 1;


-- Select dataset, expectation name, and the sum of passing and failing records for each expectation
SELECT
 row_expectations.dataset as dataset,
 row_expectations.name as expectation,
 SUM(row_expectations.passed_records) as passing_records,
 SUM(row_expectations.failed_records) as failing_records
FROM
 (
   -- Explode the JSON array of expectations into individual rows
   SELECT
     explode(
       from_json(
         details :flow_progress :data_quality :expectations,
         "array<struct<name: string, dataset: string, passed_records: int, failed_records: int>>"
       )
     ) row_expectations
   FROM
     `event_log_sales_pipeline`,
     latest_update
   WHERE
     event_type = 'flow_progress'
     AND origin.update_id = latest_update.id
 )
GROUP BY
 row_expectations.dataset,
 row_expectations.name;

This SQL query leverages the custom event log table (event_log_sales_pipeline) in the metastore to derive data quality metrics by analyzing the latest pipeline updates. Specifically:

It extracts and aggregates metrics related to expectations, such as passing_records and failing_records counts, from the following JSON field:

details:flow_progress:data_quality:expectations

It ensures analysis is focused on the most recent pipeline update by filtering using event_type = 'flow_progress' and matching update_id with the latest value in the latest_update temporary view.
Results are grouped by the dataset and expectation name, providing a concise summary of data quality performance per dataset.

This is the result generated from this query:

Pass & Fail Record Count Split on silver_quotes table

In another scenario, if your DLT pipeline stores the output in HMS, you can query the event log delta table by pointing to the event log folder (event logs are captured and stored in the system/event folder under the storage location of the DLT pipeline).

SELECT * FROM delta.`<event_log_path>`

If the user wants to query the top 10 most frequent exception class names, they can run the following command:

-- Select the top 10 most frequent exception class names from the exploded error.exceptions array
SELECT exception.class_name, COUNT(*) AS occurrence
FROM main.monitoring.event_log_sales_pipeline
LATERAL VIEW EXPLODE(error.exceptions) AS exception
GROUP BY exception.class_name
ORDER BY occurrence DESC
LIMIT 10; -- you can change this value to get more results

This is a sample result from the query above:

Frequency of Error Classes generated from Event Log Table

To measure the average interval between events for a specific pipeline to monitor activity frequency or detect lulls, we can write the following query:

WITH event_times AS (
 -- Selects timestamps from the event log for the 'Quotes Processing Pipeline' and orders them chronologically
 SELECT timestamp
 FROM main.monitoring.event_log_sales_pipeline
 WHERE origin.pipeline_name = 'Quotes Processing Pipeline'
 ORDER BY timestamp
),
diffs AS (
 -- Calculates the difference between consecutive timestamps using the LEAD function
 SELECT timestamp, LEAD(timestamp) OVER (ORDER BY timestamp) AS next_timestamp
 FROM event_times
)
-- Computes the average time difference between consecutive events in seconds
SELECT CONCAT(ROUND(AVG(UNIX_TIMESTAMP(next_timestamp) - UNIX_TIMESTAMP(timestamp))/1000, 2), ' seconds') AS avg_seconds_between_events
FROM diffs
WHERE next_timestamp IS NOT NULL;

This will yield a result of the average time interval between all the events that are getting logged:

Average Latency (seconds) between two consecutive eventsFinally, we can make use of the event_log table’s maturity property to indicate the stability of the event schema:

-- This query selects the maturity level of events, counts the total number of events for each maturity level,
-- and sums the number of fatal errors for each maturity level. The results are grouped by maturity level and
-- ordered by the total number of events in descending order.
SELECT maturity_level, COUNT(*) AS total_events,
      SUM(CASE WHEN error.fatal THEN 1 ELSE 0 END) AS fatal_errors
FROM main.monitoring.event_log_sales_pipeline
GROUP BY maturity_level
ORDER BY total_events DESC;

This will yield the following result:

Maturity levels, introduced in DLT release 2022.37, indicate the stability of event schemas within the pipeline's event log. STABLE events have guaranteed schemas that will not change across DLT versions, making them dependable anchors for monitoring solutions. In contrast, EVOLVING events have schemas that may change in future releases.

The higher proportion of STABLE events (65) versus EVOLVING events (17) suggests this pipeline has settled chiefly into reliable event patterns, which is preferable for production environments.

Utilizing the built-in event_log table in DLT is helpful for monitoring and identifying data quality trends in the pipelines, offering insights into the effectiveness of expectations defined within the system.

1.2.3 Building Dashboards from the DLT Event Log Table

You can then embed these queries into the AI/BI dashboard as custom SQL queries in the “Data” pane. One simple example is to set the dataset to the custom event log table and use either the assistant or the visualization widget toolbar to create appropriate visualizations.

SELECT * FROM main.monitoring.event_log_sales_pipeline;

Below is an example of how the visualizations are built on top of the dataset that captures all the records of the main.monitoring.event_log_sales_pipeline event log table.

Event Log Sales Pipeline Dashboard

1.3 System Tables for Monitoring

Databricks system tables offer structured metadata across pipelines, jobs, and clusters, enabling easy alerting using SQL. These tables can be queried directly or integrated with dashboards, alerts, and workflows to automate monitoring. Below are some common system tables used for monitoring DLT pipelines:

These tables can be queried directly or integrated with dashboards, alerts, and workflows to automate monitoring and incident response.

2. From Metrics to Monitoring: Internal Observability Tools

Once your event logs are available in Unity Catalog (see previous section for setup and queries), you can query and display them with visual dashboards, alerts, streaming tables, and third-party tools.

2.1 Monitoring Metrics within Databricks

Below are three practical ways to bring your metrics into actionable monitoring.

2.1.1 Visualise with AI/BI Databricks Dashboards

You can turn your SQL queries (e.g., pipeline error counts or failure statuses) into AI/BI dashboards. Users can build charts and graphs to analyze the pipelines. Check out our dedicated DBdemos page to forge a similar dashboard as shown below.

2.1.2 Set up SQL Alerts for Proactive Monitoring

Another helpful Databricks feature is SQL alerts. You can set up these alerts to monitor any pipeline discrepancies (i.e., table count exceeding the expected limit) and send notifications via SQL alerts or webhooks.

2.1.3 Build Streaming Tables for Monitoring

For continuous monitoring, you can convert the event log into a streaming table. This is especially useful when you want to detect and respond to issues, such as failed jobs and error messages, as soon as they appear in the logs.

Below is an example script on how to create an event log streaming table via a notebook:

df = spark.readStream.table("main.monitoring.event_log_sales_pipeline") \
    .filter("level = 'ERROR'")

df.writeStream \
  .format("delta") \
.option("checkpointLocation","/Volumes/main/monitoring/event_log_errors_stream_checkpoint") \
  .outputMode("append") \
  .table("main.monitoring.event_log_errors_stream")

Instead of directly querying the raw event log table repeatedly, which can be inefficient, you can create a dedicated streaming table with a filtered subset (e.g., error-level events). This has several advantages:

Improved performance: Queries on smaller, filtered tables are faster and more efficient.
Data isolation: Only the relevant data (e.g., errors) is shared, keeping the whole event log private.

Note: Although this is a streaming table, updates will only occur after the pipeline has finished running. This is because the DLT event log is written at the end of each pipeline run, not continuously during execution.

3. Example Scenario

A major retail chain implemented DLT to process its daily sales data across thousands of stores nationwide. Their pipeline followed the medallion architecture shown in the diagram: ingesting raw sales data from cloud storage and message queues into Bronze tables, transforming it into validated customer and order records in Silver, and finally creating aggregated daily sales views in the Gold layer for business consumption.

Initially, the pipeline operated smoothly, but as data volumes grew with the business expansion, the operations team began noticing concerning patterns. Daily pipeline runs were increasingly failing during peak sales periods, particularly following weekend promotions when transaction volumes spiked significantly.

The team investigated using DLT’s built-in monitoring capabilities. Analyzing the DLT event log uncovered a correlation between cost spikes and excessive autoscaling behaviour in their cluster configuration.

High Level End-to-End flow of DLT pipeline Monitoring Process

4. Conclusion

Don’t let monitoring be an afterthought. Leverage our observability capabilities to monitor pipeline performance and data quality anomalies.

🚀 Take action: Start implementing pipeline monitoring, leverage the event log, and integrate with your observability stack to unlock the full power of DLT today!

Check out the Databricks Community for even more exciting blogs about the latest features of Databricks.