<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Clarification on how Streaming Backlog Duration &amp;amp; Records are calculated in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/clarification-on-how-streaming-backlog-duration-amp-records-are/m-p/129364#M48508</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/170061"&gt;@saicharandeepb&lt;/a&gt;,&lt;BR /&gt;Here’s what I found after reproducing this in Databricks (Auto Loader and the rate source) and inspecting lastProgress.&lt;/P&gt;&lt;P&gt;TL;DR&lt;/P&gt;&lt;P&gt;Streaming Backlog (records) = how many input records are still discoverable but not yet committed by the consumer.&lt;/P&gt;&lt;P&gt;Streaming Backlog (duration) = an estimate of time to drain that backlog; in practice it depends on the current processing rate.&lt;/P&gt;&lt;P&gt;The exact mechanics vary by source (Kafka vs. files/Auto Loader vs. Delta). The Jobs UI typically shows the maximum backlog across your active streams.&lt;/P&gt;&lt;P&gt;What “records” and “duration” really reflect&lt;/P&gt;&lt;P&gt;Records&lt;/P&gt;&lt;P&gt;Kafka: effectively consumer group lag (latest available offset minus committed/end offset).&lt;/P&gt;&lt;P&gt;Files / Auto Loader / Delta: rows (or files) discovered minus rows committed to the sink (for Delta sources, think “unprocessed rows/versions” in the current scan).&lt;/P&gt;&lt;P&gt;Duration&lt;/P&gt;&lt;P&gt;Practical estimate: backlog_duration ≈ backlog_records / processedRowsPerSecond (using the stream’s current processedRowsPerSecond from lastProgress).&lt;/P&gt;&lt;P&gt;In Kafka you can also approximate via offset timestamps (age of the lag), but “time to empty” still depends on the actual processing rate right now.&lt;/P&gt;&lt;P&gt;Source-specific notes&lt;/P&gt;&lt;P&gt;Kafka: offsets and optional offset timestamps (for age). For ground truth, use the broker’s consumer-group lag.&lt;/P&gt;&lt;P&gt;Auto Loader / files: discovery queue + row counts; file modification or ingestion times can serve as a rough “age” proxy.&lt;/P&gt;&lt;P&gt;Delta source: rows not yet processed; consider commit versions if you need lineage.&lt;/P&gt;&lt;P&gt;Programmatic metrics (simple, cross-source sketch; plain text)&lt;/P&gt;&lt;P&gt;For files/Delta pipelines, compare discoverable row counts vs. committed row counts, then divide by processedRowsPerSecond for duration.&lt;/P&gt;&lt;P&gt;For Kafka, prefer broker lag from the Admin API, then divide by processedRowsPerSecond for duration.&lt;/P&gt;&lt;P&gt;Pseudocode outline:&lt;/P&gt;&lt;P&gt;find stream by matching its output path in spark.streams.active&lt;BR /&gt;read q.lastProgress (JSON)&lt;BR /&gt;proc_rps = lastProgress.processedRowsPerSecond&lt;BR /&gt;if source_kind = files:&lt;BR /&gt;total_in = count rows in input path (static read)&lt;BR /&gt;total_out = count rows in delta sink&lt;BR /&gt;backlog_records = max(0, total_in - total_out)&lt;BR /&gt;elif source_kind = delta:&lt;BR /&gt;total_in = count rows in source delta table&lt;BR /&gt;total_out = count rows in sink delta table&lt;BR /&gt;backlog_records = max(0, total_in - total_out)&lt;BR /&gt;elif source_kind = kafka:&lt;BR /&gt;backlog_records = consumer-group lag from broker (Admin API)&lt;BR /&gt;backlog_duration_seconds = backlog_records / proc_rps (if proc_rps &amp;gt; 0, else Infinity)&lt;/P&gt;&lt;P&gt;Practical tips for your dashboard&lt;/P&gt;&lt;P&gt;Kafka: display broker lag for “records” and compute “duration” from processedRowsPerSecond. This is most accurate and reflects real throughput.&lt;/P&gt;&lt;P&gt;Files/Delta: prefer row counts over file counts; they’re more stable. If you need an “age” metric, use ingestion or commit times as a proxy and label it clearly.&lt;/P&gt;&lt;P&gt;Show a per-stream breakdown plus a headline metric with the maximum backlog and maximum duration (mirrors the UI’s behavior).&lt;/P&gt;&lt;P&gt;Smooth the duration with a moving average to avoid noisy alerts (rates fluctuate batch-by-batch).&lt;BR /&gt;&lt;BR /&gt;Hope that helps!&lt;/P&gt;</description>
    <pubDate>Fri, 22 Aug 2025 20:48:59 GMT</pubDate>
    <dc:creator>WiliamRosa</dc:creator>
    <dc:date>2025-08-22T20:48:59Z</dc:date>
    <item>
      <title>Clarification on how Streaming Backlog Duration &amp; Records are calculated</title>
      <link>https://community.databricks.com/t5/data-engineering/clarification-on-how-streaming-backlog-duration-amp-records-are/m-p/129316#M48493</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;I’m working on preparing a &lt;STRONG&gt;dashboard for streaming observability&lt;/STRONG&gt; and I’m trying to understand how some of the backlog metrics shown in Databricks are actually calculated.&lt;/P&gt;&lt;P&gt;In particular, I’m looking at:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Streaming Backlog (records):&lt;/STRONG&gt; described as the expected maximum offset lag across all streams.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Streaming Backlog (duration):&lt;/STRONG&gt; described as the expected maximum consumer delay across all streams.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;From the docs, I understand the definitions, but for my dashboard I’d like to know:&lt;/P&gt;&lt;P&gt;-How exactly are these values computed internally in Databricks?&lt;BR /&gt;-Are they based on offsets, event-time timestamps, processing rates, or a combination?&lt;BR /&gt;-Do they vary by source (e.g.,&amp;nbsp; Auto Loader vs. Delta)?&lt;/P&gt;&lt;P&gt;Any clarification or additional detail would really help me design accurate monitoring and alerting in my dashboard.&lt;/P&gt;&lt;P&gt;Thanks in advance.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 22 Aug 2025 14:58:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/clarification-on-how-streaming-backlog-duration-amp-records-are/m-p/129316#M48493</guid>
      <dc:creator>saicharandeepb</dc:creator>
      <dc:date>2025-08-22T14:58:16Z</dc:date>
    </item>
    <item>
      <title>Re: Clarification on how Streaming Backlog Duration &amp; Records are calculated</title>
      <link>https://community.databricks.com/t5/data-engineering/clarification-on-how-streaming-backlog-duration-amp-records-are/m-p/129364#M48508</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/170061"&gt;@saicharandeepb&lt;/a&gt;,&lt;BR /&gt;Here’s what I found after reproducing this in Databricks (Auto Loader and the rate source) and inspecting lastProgress.&lt;/P&gt;&lt;P&gt;TL;DR&lt;/P&gt;&lt;P&gt;Streaming Backlog (records) = how many input records are still discoverable but not yet committed by the consumer.&lt;/P&gt;&lt;P&gt;Streaming Backlog (duration) = an estimate of time to drain that backlog; in practice it depends on the current processing rate.&lt;/P&gt;&lt;P&gt;The exact mechanics vary by source (Kafka vs. files/Auto Loader vs. Delta). The Jobs UI typically shows the maximum backlog across your active streams.&lt;/P&gt;&lt;P&gt;What “records” and “duration” really reflect&lt;/P&gt;&lt;P&gt;Records&lt;/P&gt;&lt;P&gt;Kafka: effectively consumer group lag (latest available offset minus committed/end offset).&lt;/P&gt;&lt;P&gt;Files / Auto Loader / Delta: rows (or files) discovered minus rows committed to the sink (for Delta sources, think “unprocessed rows/versions” in the current scan).&lt;/P&gt;&lt;P&gt;Duration&lt;/P&gt;&lt;P&gt;Practical estimate: backlog_duration ≈ backlog_records / processedRowsPerSecond (using the stream’s current processedRowsPerSecond from lastProgress).&lt;/P&gt;&lt;P&gt;In Kafka you can also approximate via offset timestamps (age of the lag), but “time to empty” still depends on the actual processing rate right now.&lt;/P&gt;&lt;P&gt;Source-specific notes&lt;/P&gt;&lt;P&gt;Kafka: offsets and optional offset timestamps (for age). For ground truth, use the broker’s consumer-group lag.&lt;/P&gt;&lt;P&gt;Auto Loader / files: discovery queue + row counts; file modification or ingestion times can serve as a rough “age” proxy.&lt;/P&gt;&lt;P&gt;Delta source: rows not yet processed; consider commit versions if you need lineage.&lt;/P&gt;&lt;P&gt;Programmatic metrics (simple, cross-source sketch; plain text)&lt;/P&gt;&lt;P&gt;For files/Delta pipelines, compare discoverable row counts vs. committed row counts, then divide by processedRowsPerSecond for duration.&lt;/P&gt;&lt;P&gt;For Kafka, prefer broker lag from the Admin API, then divide by processedRowsPerSecond for duration.&lt;/P&gt;&lt;P&gt;Pseudocode outline:&lt;/P&gt;&lt;P&gt;find stream by matching its output path in spark.streams.active&lt;BR /&gt;read q.lastProgress (JSON)&lt;BR /&gt;proc_rps = lastProgress.processedRowsPerSecond&lt;BR /&gt;if source_kind = files:&lt;BR /&gt;total_in = count rows in input path (static read)&lt;BR /&gt;total_out = count rows in delta sink&lt;BR /&gt;backlog_records = max(0, total_in - total_out)&lt;BR /&gt;elif source_kind = delta:&lt;BR /&gt;total_in = count rows in source delta table&lt;BR /&gt;total_out = count rows in sink delta table&lt;BR /&gt;backlog_records = max(0, total_in - total_out)&lt;BR /&gt;elif source_kind = kafka:&lt;BR /&gt;backlog_records = consumer-group lag from broker (Admin API)&lt;BR /&gt;backlog_duration_seconds = backlog_records / proc_rps (if proc_rps &amp;gt; 0, else Infinity)&lt;/P&gt;&lt;P&gt;Practical tips for your dashboard&lt;/P&gt;&lt;P&gt;Kafka: display broker lag for “records” and compute “duration” from processedRowsPerSecond. This is most accurate and reflects real throughput.&lt;/P&gt;&lt;P&gt;Files/Delta: prefer row counts over file counts; they’re more stable. If you need an “age” metric, use ingestion or commit times as a proxy and label it clearly.&lt;/P&gt;&lt;P&gt;Show a per-stream breakdown plus a headline metric with the maximum backlog and maximum duration (mirrors the UI’s behavior).&lt;/P&gt;&lt;P&gt;Smooth the duration with a moving average to avoid noisy alerts (rates fluctuate batch-by-batch).&lt;BR /&gt;&lt;BR /&gt;Hope that helps!&lt;/P&gt;</description>
      <pubDate>Fri, 22 Aug 2025 20:48:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/clarification-on-how-streaming-backlog-duration-amp-records-are/m-p/129364#M48508</guid>
      <dc:creator>WiliamRosa</dc:creator>
      <dc:date>2025-08-22T20:48:59Z</dc:date>
    </item>
  </channel>
</rss>

