cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
rakhidarshi
Databricks Employee
Databricks Employee

rakhidarshi_1-1746001322316.png

In today’s world, data is everything. It drives decisions, powers automation, and fuels innovation. But with all this power comes a challenge: ensuring that the data we rely on remains accurate, trustworthy, and, most importantly, healthy. As organizations expand and their data pipelines grow more complex, tracking how and when data changes can quickly become overwhelming.

Imagine you’re managing a large-scale data pipeline, and suddenly, something goes wrong. A downstream process fails, a report is generated with missing data, or performance drops unexpectedly. How do you pinpoint the issue when there’s so much data flowing through multiple layers of your pipeline? This is where Delta Table History comes in to save the day.

Delta Table History is a feature in Databricks that provides visibility into every change, update, and modification made to your data tables. It logs all operations, from simple inserts to complex merges, and stores crucial metadata like who made the change, when it was made, and what was modified. Think of it as a logbook of every action that has taken place on your Delta Table, allowing you to monitor, audit, and troubleshoot your data in ways you’ve never imagined.

But that’s not all. Beyond simply tracking data changes, Delta Table History unlocks the potential for advanced operational monitoring. By digging deeper into the metadata and transaction logs, you can uncover patterns, identify inefficiencies, and set up automated alerts to keep you informed in real-time.

Let’s explore how you can leverage the data table history to evaluate your data table health.

 

Quick look into Delta Table History:

Now that you are familiar with the expected outcome, let’s get to know the structure of Delta Table History first. You can retrieve a lot of information including the operations, user, and timestamp for each transaction to a Delta table by running the history command. The operations are returned in reverse chronological order.

For Spark SQL syntax details, see DESCRIBE HISTORY.

See the Delta Lake API documentation for Scala/Java/Python syntax details.

A high level schema design of the Delta table history is described below. For more details on columns, especially operationMetrics, please refer to this documentation.

 

Column

Type

Description

version

long

Table version generated by the operation.

timestamp

timestamp

When this version was committed.

userId

string

ID of the user that ran the operation.

userName

string

Name of the user that ran the operation.

operation

string

Name of the operation.

operationParameters

map

Parameters of the operation (for example, predicates.)

job

struct

Details of the job that ran the operation.

notebook

struct

Details of the notebook from which the operation was run.

clusterId

string

ID of the cluster on which the operation ran.

readVersion

long

Version of the table that was read to perform the write operation.

isolationLevel

string

Isolation level used for this operation.

isBlindAppend

boolean

Whether this operation appended data.

operationMetrics

map

Metrics of the operation (for example, number of rows and files modified.)

userMetadata

string

User-defined commit metadata if it was specified

With this information at your fingertips, you can uncover invaluable insights to monitor the health of your delta table.

 

Derivable KPIs from Delta table history:

Now that we’ve unlocked the history of changes, how do we transform this data into actionable insights? This is where we define some KPIs (Key Performance Indicators) to help create measurable metrics. These KPIs will help you track how well your Delta Table is performing and whether it’s operating within expected parameters.

But before tracking the metrics, it is important to understand the data retention and availability for the history table. Please go through the documentation here

Now, here are some key KPIs that can be derived from details present in Delta Table History:

Change Volume

To monitor the number of rows affected by operations (e.g., INSERT, UPDATE, DELETE) on a Delta Table within a specific time period. This KPI helps track the table's activity level and identify whether there is a need to perform tuning. For example, OPTIMIZE operation to reduce the number of small files and improve query performance or understand the trends in number of updated/inserted/deleted records.

This can be done simply using the query below:

%sql
WITH history AS (
 -- Get the Delta Table history using DESC HISTORY
 DESCRIBE HISTORY <table>
),
delta_history AS (
 -- Flatten the history results and extract the necessary columns
 SELECT
   operation,
   operationMetrics["numOutputRows"] AS numOutputRows,  -- Number of rows affected
   operationMetrics["numFiles"] AS numFiles,
   operationMetrics["bytesAdded"] AS bytesAdded,             -- Number of files affected
   operationMetrics["bytesRemoved"] AS bytesRemoved,
   CAST(timestamp AS DATE) AS operation_date             -- Extract the date from the timestamp
 FROM
   history
)


-- Calculate the total rows affected, files impacted, and split by operation and date
SELECT
   operation_date,                   -- Group by date
   operation,                        -- Group by operation type (INSERT, UPDATE, DELETE, etc.)
   SUM(numOutputRows) AS total_rows_affected,  -- Total rows affected by operations per day
   SUM(numFiles) AS total_files_written,       -- Total files affected by operations per day
   SUM(bytesAdded) AS total_bytes_added,  -- Total bytes added by operations per day
   SUM(bytesRemoved) AS total_bytes_removed       -- Total bytes removed by operations per day
FROM
   delta_history
GROUP BY
   operation_date,                   -- Group by operation date
   operation                         -- Group by operation type (INSERT, UPDATE, DELETE)
ORDER BY
   operation_date DESC,              -- Sort by date (most recent first)
   operation; 

Data Insertion/Update Speed (Execution Time)

The Data Insertion/Update Speed KPI measures the time it takes to perform a data modification operation on a Delta Table, such as an INSERT, UPDATE, or MERGE. This metric helps assess the performance of data modification operations in your Delta Lake pipeline, providing insights into how efficiently data is being ingested or updated.

Apart from total execution time, you can visualize scan time and re-write time as well.

%sql
WITH history AS (
 -- Get the Delta Table history using DESC HISTORY
 DESCRIBE HISTORY <table>
),
delta_history AS (
 -- Flatten the history results and extract the necessary columns
 SELECT
   operation,
   operationMetrics["executionTimeMs"] AS execution_time_ms,  -- Total execution time in milliseconds
   operationMetrics["scanTimeMs"] AS scan_time_ms,            -- Time taken for scanning data
   operationMetrics["rewriteTimeMs"] AS rewrite_time_ms,      -- Time taken for rewriting data
   CAST(timestamp AS DATE) AS operation_date  -- Extract the date from the timestamp
 FROM
   history
 WHERE
   operation IN ('WRITE', 'UPDATE', 'MERGE', 'WRITE')  -- Filter for data modification operations only
)


-- Calculate the execution time breakdown (Scan Time, Rewrite Time, and Other Time)
SELECT
   operation_date,                     -- Group by operation date
   operation,                          -- Group by operation type (WRITE, UPDATE, MERGE)
   AVG(scan_time_ms) AS avg_scan_time_ms,         -- Average scan time per operation
   AVG(rewrite_time_ms) AS avg_rewrite_time_ms,   -- Average rewrite time per operation
   AVG(execution_time_ms) AS avg_execution_time_ms,  -- Average total execution time
   MIN(scan_time_ms) AS min_scan_time_ms,  -- Minimum scan_time_ms
   MAX(scan_time_ms) AS max_scan_time_ms,   -- Maximum scan_time_ms
   MIN(rewrite_time_ms) AS min_rewrite_time_ms,  -- Minimum rewrite_time_ms
   MAX(rewrite_time_ms) AS max_rewrite_time_ms,   -- Maximum rewrite_time_ms
   MIN(execution_time_ms) AS min_execution_time_ms,  -- Minimum execution time
   MAX(execution_time_ms) AS max_execution_time_ms   -- Maximum execution time
FROM
   delta_history
GROUP BY
   operation_date,                     -- Group by operation date
   operation                           -- Group by operation type (WRITE, UPDATE, MERGE)
ORDER BY
   operation_date DESC,                -- Sort by date (most recent first)
   operation;                          -- Sort by operation type

You can also make this more interesting by comparing the scan and write times per hour or distributed by user.

Most Active Users/Jobs Impacting Table

Tracking the most active users or jobs impacting table updates provides valuable insights into the behavior and health of your table, especially in cases where multiple processes and stakeholders are making changes. By monitoring who is making the most updates and understanding the volume of changes being made, you can proactively address issues, detect inefficiencies, and ensure smooth operation across all users and processes involved.

This can be calculated with the below query:

%sql
WITH history AS (
 -- Get the Delta Table history using DESC HISTORY
 DESCRIBE HISTORY <table>
),
delta_history AS (
 -- Flatten the history results and extract the necessary columns
 SELECT
   userName,                        -- The user making the modification
   job["jobid"] as jobid,                           -- The job executing the operation
   operation,                       -- The type of operation (WRITE, UPDATE, MERGE, etc.)
   operationMetrics.numOutputRows AS num_output_rows, -- Rows affected by the operation
   CAST(timestamp AS DATE) AS operation_date  -- Extract the date from the timestamp
 FROM
   history
 WHERE
   operation IN ('INSERT', 'UPDATE', 'MERGE', 'WRITE')  -- Filter for data modification operations only
)


-- Top Users (Only if jobId is NULL)
SELECT
   userName,                            -- Group by user
   COUNT(*) AS num_operations,          -- Count the number of operations performed by the user
   SUM(num_output_rows) AS total_rows_affected  -- Sum of rows affected by the user
FROM
   delta_history
WHERE
   jobid IS NULL  -- Only consider users with NULL jobId
GROUP BY
   userName
ORDER BY
   num_operations DESC,                 -- Sort by number of operations (most active first)
   total_rows_affected DESC             -- Sort by the number of rows affected
LIMIT 10   -- Limit the results to top 10 users
;


-- Top Jobs (Only if jobId is NOT NULL)
SELECT
   jobId,                               -- Group by job
   COUNT(*) AS num_operations,          -- Count the number of operations performed by the job
   SUM(num_output_rows) AS total_rows_affected  -- Sum of rows affected by the job
FROM
   delta_history
WHERE
   jobId IS NOT NULL  -- Only consider jobs with non-null jobId
GROUP BY
   jobId
ORDER BY
   num_operations DESC,                 -- Sort by number of operations (most active first)
   total_rows_affected DESC             -- Sort by the number of rows affected
LIMIT 10   -- Limit the results to top 10 jobs


Other important KPIs that can be used include…..

Update volume by Users/Jobs

This KPI measures the volume of data updates made by individual users or specific jobs within a Delta Table over a defined period. By tracking this KPI, you can gain insights into which users or jobs are driving the most data modifications.

Transaction Throughput (Operations Per Time Period)

The Transaction Throughput KPI measures the number of operations performed on a Delta Table within a specific time period. It helps assess the overall activity level of the table and the efficiency of data processing workflows.

Data Growth Rate (Update Size or Number of Files)

This KPI measures the increase in table size or the number of files over time, providing insight into how quickly your Delta Table is growing. By tracking this metric, you can monitor both the volume of data being added and the rate at which new data files are created.

Delta Table Versioning and Schema Evolution

This KPI measures the effectiveness and efficiency of managing changes to the structure and historical states of your Delta Table. By tracking this KPI, you can monitor how frequently the table schema evolves—such as the addition of new columns or data type changes—and ensure that schema updates are handled smoothly.

These updates can be identified using the ‘operation’ column by checking for alter table commands like “ADD COLUMN”. 

Note: This is only valid when tables are updated using ALTER commands. If schema changes are made using Schema evolution merges, then it cannot be tracked in the history table.

Data Retention Compliance (File Cleanliness)

The Data Retention Compliance (File Cleanliness) KPI measures how effectively data is managed in accordance with retention policies, ensuring that obsolete or unnecessary files are appropriately dealt with.

These are some of the KPIs that can be derived, obviously you can customize and create new KPIs as per your needs.

 

Setting up Monitoring dashboards:

rakhidarshi_2-1746001323189.png

Once you have all these KPIs, it’s time to turn them into real-time monitoring tools. In Databricks, you can set up dashboards that visually display these KPIs, allowing you to monitor the health of your Delta Tables at a glance. Imagine having a dashboard that shows transaction throughput, execution times, and change volumes all in one place. Some simple examples could include:

  • A "Pipeline Health" dashboard could display KPIs such as change volume and execution times.
  • A "Data Growth and Optimization" dashboard could focus on data growth, the number of files, and the frequency of optimization or vacuum operations.

rakhidarshi_3-1746001322601.png

To create an AI/BI dashboard, please refer to the Databricks documentation. You can also enable Genie spaces to help answer any spontaneous questions based on added datasets.

 

Setting up Alerts:

rakhidarshi_4-1746001323757.png

Setting up alerts can help owners be on top of any unexpected events that might crop up. Suppose you're managing a Delta Table that is critical to your data pipeline. This table experiences frequent transactions, and its performance, data growth, and operational efficiency directly impact downstream processes. The table is essential for running business-critical reports and feeding other systems.

In such cases, to ensure the health of the Delta Table, it’s essential to set up automated alerts when these KPIs exceed predefined thresholds. For example, you may want to be alerted when:

  • The transaction throughput suddenly spikes beyond a threshold (e.g., more than 100 operations per minute), indicating a potential system overload.
  • Execution time exceeds a threshold (e.g., more than 2 minutes per operation), suggesting a possible performance issue.
  • Change volume exceeds a certain limit (e.g., more than 1 million rows updated in an hour), pointing to unusually large data updates that may need optimization.
  • Data growth rate accelerates unexpectedly (e.g., table size growing by 20% in a week), suggesting that data might be accumulating faster than expected.
  • File cleanliness becomes an issue, such as large increases in the number of files getting updated with relatively smaller updates in the number of rows, indicating that data cleanup processes are failing.

These alerts allow you to act proactively, preventing data issues from escalating into critical problems.

To know more about setting up Alerts, please refer to the Databricks documentation

 

Anomaly Detection - Lakehouse monitoring

While custom KPIs and alerts derived from Delta Table History offer deep control and flexibility, Databricks also provides a built-in anomaly detection feature as part of its Lakehouse Monitoring suite. This feature, currently in beta, helps you monitor table-level data quality issues like freshness and completeness - without requiring any modifications to your data pipelines or additional ETL overhead.

🔎 What It Detects

Databricks' anomaly detection automatically scans tables in a Unity Catalog schema and flags:

  • Freshness Anomalies: When a table has not been updated within the expected time window based on historical trends.
  • Completeness Anomalies: When the number of rows written is significantly lower than predicted, indicating a potential data drop or job failure.

To learn more about this offering, please refer to Databricks documentation.

 

Conclusion: A Healthier Data Pipeline Awaits

By leveraging Delta Table History and tracking KPIs, you gain control over the operational health of your Delta Tables. With the right tools and insights, you can ensure that your data pipeline runs smoothly, efficiently, and without surprises. From tracking data changes to measuring performance and identifying issues early, Delta Table History offers a robust solution for any data-driven organization.

So, take charge of your data today and make your Delta Tables not just reliable but resilient. The road to operational excellence begins with understanding what’s happening behind the scenes, and Delta Table History is your perfect guide.