Hi @saicharandeepb,
Here’s what I found after reproducing this in Databricks (Auto Loader and the rate source) and inspecting lastProgress.
TL;DR
Streaming Backlog (records) = how many input records are still discoverable but not yet committed by the consumer.
Streaming Backlog (duration) = an estimate of time to drain that backlog; in practice it depends on the current processing rate.
The exact mechanics vary by source (Kafka vs. files/Auto Loader vs. Delta). The Jobs UI typically shows the maximum backlog across your active streams.
What “records” and “duration” really reflect
Records
Kafka: effectively consumer group lag (latest available offset minus committed/end offset).
Files / Auto Loader / Delta: rows (or files) discovered minus rows committed to the sink (for Delta sources, think “unprocessed rows/versions” in the current scan).
Duration
Practical estimate: backlog_duration ≈ backlog_records / processedRowsPerSecond (using the stream’s current processedRowsPerSecond from lastProgress).
In Kafka you can also approximate via offset timestamps (age of the lag), but “time to empty” still depends on the actual processing rate right now.
Source-specific notes
Kafka: offsets and optional offset timestamps (for age). For ground truth, use the broker’s consumer-group lag.
Auto Loader / files: discovery queue + row counts; file modification or ingestion times can serve as a rough “age” proxy.
Delta source: rows not yet processed; consider commit versions if you need lineage.
Programmatic metrics (simple, cross-source sketch; plain text)
For files/Delta pipelines, compare discoverable row counts vs. committed row counts, then divide by processedRowsPerSecond for duration.
For Kafka, prefer broker lag from the Admin API, then divide by processedRowsPerSecond for duration.
Pseudocode outline:
find stream by matching its output path in spark.streams.active
read q.lastProgress (JSON)
proc_rps = lastProgress.processedRowsPerSecond
if source_kind = files:
total_in = count rows in input path (static read)
total_out = count rows in delta sink
backlog_records = max(0, total_in - total_out)
elif source_kind = delta:
total_in = count rows in source delta table
total_out = count rows in sink delta table
backlog_records = max(0, total_in - total_out)
elif source_kind = kafka:
backlog_records = consumer-group lag from broker (Admin API)
backlog_duration_seconds = backlog_records / proc_rps (if proc_rps > 0, else Infinity)
Practical tips for your dashboard
Kafka: display broker lag for “records” and compute “duration” from processedRowsPerSecond. This is most accurate and reflects real throughput.
Files/Delta: prefer row counts over file counts; they’re more stable. If you need an “age” metric, use ingestion or commit times as a proxy and label it clearly.
Show a per-stream breakdown plus a headline metric with the maximum backlog and maximum duration (mirrors the UI’s behavior).
Smooth the duration with a moving average to avoid noisy alerts (rates fluctuate batch-by-batch).
Hope that helps!
Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa