cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Clarification on how Streaming Backlog Duration & Records are calculated

saicharandeepb
New Contributor II

Hi all,

I’m working on preparing a dashboard for streaming observability and I’m trying to understand how some of the backlog metrics shown in Databricks are actually calculated.

In particular, I’m looking at:

  • Streaming Backlog (records): described as the expected maximum offset lag across all streams.

  • Streaming Backlog (duration): described as the expected maximum consumer delay across all streams.

From the docs, I understand the definitions, but for my dashboard I’d like to know:

-How exactly are these values computed internally in Databricks?
-Are they based on offsets, event-time timestamps, processing rates, or a combination?
-Do they vary by source (e.g.,  Auto Loader vs. Delta)?

Any clarification or additional detail would really help me design accurate monitoring and alerting in my dashboard.

Thanks in advance.

1 REPLY 1

WiliamRosa
New Contributor II

Hi @saicharandeepb,
Here’s what I found after reproducing this in Databricks (Auto Loader and the rate source) and inspecting lastProgress.

TL;DR

Streaming Backlog (records) = how many input records are still discoverable but not yet committed by the consumer.

Streaming Backlog (duration) = an estimate of time to drain that backlog; in practice it depends on the current processing rate.

The exact mechanics vary by source (Kafka vs. files/Auto Loader vs. Delta). The Jobs UI typically shows the maximum backlog across your active streams.

What “records” and “duration” really reflect

Records

Kafka: effectively consumer group lag (latest available offset minus committed/end offset).

Files / Auto Loader / Delta: rows (or files) discovered minus rows committed to the sink (for Delta sources, think “unprocessed rows/versions” in the current scan).

Duration

Practical estimate: backlog_duration ≈ backlog_records / processedRowsPerSecond (using the stream’s current processedRowsPerSecond from lastProgress).

In Kafka you can also approximate via offset timestamps (age of the lag), but “time to empty” still depends on the actual processing rate right now.

Source-specific notes

Kafka: offsets and optional offset timestamps (for age). For ground truth, use the broker’s consumer-group lag.

Auto Loader / files: discovery queue + row counts; file modification or ingestion times can serve as a rough “age” proxy.

Delta source: rows not yet processed; consider commit versions if you need lineage.

Programmatic metrics (simple, cross-source sketch; plain text)

For files/Delta pipelines, compare discoverable row counts vs. committed row counts, then divide by processedRowsPerSecond for duration.

For Kafka, prefer broker lag from the Admin API, then divide by processedRowsPerSecond for duration.

Pseudocode outline:

find stream by matching its output path in spark.streams.active
read q.lastProgress (JSON)
proc_rps = lastProgress.processedRowsPerSecond
if source_kind = files:
total_in = count rows in input path (static read)
total_out = count rows in delta sink
backlog_records = max(0, total_in - total_out)
elif source_kind = delta:
total_in = count rows in source delta table
total_out = count rows in sink delta table
backlog_records = max(0, total_in - total_out)
elif source_kind = kafka:
backlog_records = consumer-group lag from broker (Admin API)
backlog_duration_seconds = backlog_records / proc_rps (if proc_rps > 0, else Infinity)

Practical tips for your dashboard

Kafka: display broker lag for “records” and compute “duration” from processedRowsPerSecond. This is most accurate and reflects real throughput.

Files/Delta: prefer row counts over file counts; they’re more stable. If you need an “age” metric, use ingestion or commit times as a proxy and label it clearly.

Show a per-stream breakdown plus a headline metric with the maximum backlog and maximum duration (mirrors the UI’s behavior).

Smooth the duration with a moving average to avoid noisy alerts (rates fluctuate batch-by-batch).

Hope that helps!

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now