<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Capturing Streaming Metrics in Near Real-Time Using Cluster Logs in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/capturing-streaming-metrics-in-near-real-time-using-cluster-logs/m-p/134636#M50160</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/170061"&gt;@saicharandeepb&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Good job on doing such detailed research on monitoring structured streaming.&amp;nbsp;If you need lower latency than rolling log permits, then have you tried this:&lt;BR /&gt;&lt;STRONG&gt;Cluster-wide listener injection&lt;/STRONG&gt;: Use &lt;CODE class="qt3gz9f"&gt;spark.extraListeners&lt;/CODE&gt; to register a custom listener JAR that emits &lt;CODE class="qt3gz9f"&gt;StreamingQueryProgress&lt;/CODE&gt; metrics directly to your sink (Event Hubs, Delta). This avoids per-job code changes but does alter cluster config and requires packaging and maintenance. This will&amp;nbsp;&lt;STRONG data-start="107" data-end="141"&gt;inject a cluster-wide listener&lt;/STRONG&gt; using &lt;CODE data-start="148" data-end="170"&gt;spark.extraListeners&lt;/CODE&gt; so &lt;STRONG data-start="174" data-end="187"&gt;every job&lt;/STRONG&gt; on the cluster automatically emits &lt;STRONG data-start="223" data-end="272"&gt;Structured Streaming &lt;CODE data-start="246" data-end="270"&gt;StreamingQueryProgress&lt;/CODE&gt;&lt;/STRONG&gt; metrics to your sink (Event Hubs, Delta, etc.).&lt;BR data-start="320" data-end="323" /&gt;Pros: no per-streaming code changes.&lt;BR data-start="350" data-end="353" /&gt;Cons: you’re changing cluster behavior for &lt;EM data-start="396" data-end="401"&gt;all&lt;/EM&gt; jobs and you now own a JAR + config.&lt;/P&gt;</description>
    <pubDate>Sat, 11 Oct 2025 17:36:45 GMT</pubDate>
    <dc:creator>Krishna_S</dc:creator>
    <dc:date>2025-10-11T17:36:45Z</dc:date>
    <item>
      <title>Capturing Streaming Metrics in Near Real-Time Using Cluster Logs</title>
      <link>https://community.databricks.com/t5/data-engineering/capturing-streaming-metrics-in-near-real-time-using-cluster-logs/m-p/134491#M50144</link>
      <description>&lt;P&gt;Over the past few weeks, I’ve been exploring ways to capture &lt;STRONG&gt;streaming metrics&lt;/STRONG&gt; from our data load jobs. The goal is to monitor job performance and behavior in &lt;STRONG&gt;real time&lt;/STRONG&gt;, without disrupting our existing data load pipelines.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Initial Exploration: StreamingQueryListener vs Cluster Logs&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;My initial approach involved using &lt;STRONG&gt;StreamingQueryListener&lt;/STRONG&gt;, which provides hooks into Spark’s streaming lifecycle. While it only requires minor code changes, integrating it into production jobs could introduce risks or instability—something we wanted to avoid.&lt;/P&gt;&lt;P&gt;To keep things non-intrusive, I pivoted to &lt;STRONG&gt;cluster logs&lt;/STRONG&gt;, which are already being generated and can be redirected to a volume for external processing. This method aligns well with our operational constraints and avoids touching the running jobs.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Observations After Enabling Cluster Logs&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Once the logs were redirected to a volume, I noticed an interesting behavior in how Spark handles event logging:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Log Compression Behavior&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;When a streaming job starts, an &lt;STRONG&gt;eventlog file&lt;/STRONG&gt; is created, and logs begin accumulating in it.&lt;/LI&gt;&lt;LI&gt;Every &lt;STRONG&gt;10 minutes&lt;/STRONG&gt;, the accumulated logs in the eventlog file are &lt;STRONG&gt;compressed into a new .gz file&lt;/STRONG&gt;.&lt;/LI&gt;&lt;LI&gt;After compression, the eventlog file continues collecting logs for the next interval.&lt;/LI&gt;&lt;LI&gt;This cycle repeats throughout the job duration.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Example Scenario&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Suppose a streaming job runs for &lt;STRONG&gt;3 hours and 8 minutes&lt;/STRONG&gt;. Here's what happens:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;You’ll get &lt;STRONG&gt;18 .gz files&lt;/STRONG&gt; for the first 3 hours (3 hours × 6 files/hour).&lt;/LI&gt;&lt;LI&gt;The &lt;STRONG&gt;remaining 8 minutes of logs&lt;/STRONG&gt; will still reside in the &lt;STRONG&gt;eventlog file&lt;/STRONG&gt;, as they haven’t yet reached the next 10-minute compression threshold.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="saicharandeepb_0-1760081131866.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/20643i7553B2AB689F4930/image-size/medium?v=v2&amp;amp;px=400" role="button" title="saicharandeepb_0-1760081131866.png" alt="saicharandeepb_0-1760081131866.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This rolling compression mechanism is efficient for log management but introduces complexity when trying to &lt;STRONG&gt;capture streaming metrics in real time&lt;/STRONG&gt;, especially since the logs are split across multiple compressed files and the active eventlog file.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;The Challenge: Real-Time Parsing of Compressed Logs&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;To extract metrics like input rate, processing rate, and batch duration, we need to:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Continuously monitor the volume for new .gz files&lt;/LI&gt;&lt;LI&gt;Decompress and parse each file as it arrives&lt;/LI&gt;&lt;LI&gt;Extract relevant metrics from the event log JSON structure&lt;/LI&gt;&lt;LI&gt;Stream the metrics to a monitoring or visualization system&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;However, &lt;STRONG&gt;Structured Streaming doesn’t natively support reading .gz files&lt;/STRONG&gt;, which complicates real-time processing.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Scalability Concerns&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;One workaround I considered was using &lt;STRONG&gt;Azure Event Hub&lt;/STRONG&gt; to stream the parsed metrics from each .gz file. But with &lt;STRONG&gt;over 100 data load pipelines&lt;/STRONG&gt;, processing each file in a loop would likely fall short of our goal of &lt;STRONG&gt;near real-time monitoring&lt;/STRONG&gt;. The latency introduced by sequential processing could make the metrics less actionable.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Open Call for Suggestions&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I’m reaching out to the community to ask:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Has anyone implemented a similar approach using cluster logs?&lt;/LI&gt;&lt;LI&gt;How did you handle real-time parsing of compressed log files?&lt;/LI&gt;&lt;LI&gt;Are there scalable solutions or best practices you’d recommend?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;If you’ve tackled this challenge or have ideas, I’d love to hear from you. One idea I’m exploring is using &lt;STRONG&gt;Event Hub&lt;/STRONG&gt; to stream parsed metrics, but I’m open to better or more efficient alternatives.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 10 Oct 2025 07:26:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/capturing-streaming-metrics-in-near-real-time-using-cluster-logs/m-p/134491#M50144</guid>
      <dc:creator>saicharandeepb</dc:creator>
      <dc:date>2025-10-10T07:26:45Z</dc:date>
    </item>
    <item>
      <title>Re: Capturing Streaming Metrics in Near Real-Time Using Cluster Logs</title>
      <link>https://community.databricks.com/t5/data-engineering/capturing-streaming-metrics-in-near-real-time-using-cluster-logs/m-p/134636#M50160</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/170061"&gt;@saicharandeepb&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Good job on doing such detailed research on monitoring structured streaming.&amp;nbsp;If you need lower latency than rolling log permits, then have you tried this:&lt;BR /&gt;&lt;STRONG&gt;Cluster-wide listener injection&lt;/STRONG&gt;: Use &lt;CODE class="qt3gz9f"&gt;spark.extraListeners&lt;/CODE&gt; to register a custom listener JAR that emits &lt;CODE class="qt3gz9f"&gt;StreamingQueryProgress&lt;/CODE&gt; metrics directly to your sink (Event Hubs, Delta). This avoids per-job code changes but does alter cluster config and requires packaging and maintenance. This will&amp;nbsp;&lt;STRONG data-start="107" data-end="141"&gt;inject a cluster-wide listener&lt;/STRONG&gt; using &lt;CODE data-start="148" data-end="170"&gt;spark.extraListeners&lt;/CODE&gt; so &lt;STRONG data-start="174" data-end="187"&gt;every job&lt;/STRONG&gt; on the cluster automatically emits &lt;STRONG data-start="223" data-end="272"&gt;Structured Streaming &lt;CODE data-start="246" data-end="270"&gt;StreamingQueryProgress&lt;/CODE&gt;&lt;/STRONG&gt; metrics to your sink (Event Hubs, Delta, etc.).&lt;BR data-start="320" data-end="323" /&gt;Pros: no per-streaming code changes.&lt;BR data-start="350" data-end="353" /&gt;Cons: you’re changing cluster behavior for &lt;EM data-start="396" data-end="401"&gt;all&lt;/EM&gt; jobs and you now own a JAR + config.&lt;/P&gt;</description>
      <pubDate>Sat, 11 Oct 2025 17:36:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/capturing-streaming-metrics-in-near-real-time-using-cluster-logs/m-p/134636#M50160</guid>
      <dc:creator>Krishna_S</dc:creator>
      <dc:date>2025-10-11T17:36:45Z</dc:date>
    </item>
    <item>
      <title>Re: Capturing Streaming Metrics in Near Real-Time Using Cluster Logs</title>
      <link>https://community.databricks.com/t5/data-engineering/capturing-streaming-metrics-in-near-real-time-using-cluster-logs/m-p/134797#M50194</link>
      <description>&lt;P&gt;Thanks a lot for the suggestion and for highlighting the cluster-wide listener injection approach using spark.extraListeners. We did consider this option during our initial evaluation and were quite intrigued by its potential to eliminate per-job code changes.&lt;BR /&gt;However, we eventually stepped back from it due to the overhead of maintaining a custom JAR especially in a shared environment where changes can have broader implications. For now, we’ve inclined towards the rolling log-based approach as it doesn’t require job-level changes and avoids the overhead of maintaining custom JARs making it more manageable and scalable for our current setup&lt;BR /&gt;That said, we’re definitely open to learning more and exploring other strategies and would love to hear more from you or others in the community.&lt;/P&gt;</description>
      <pubDate>Tue, 14 Oct 2025 05:55:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/capturing-streaming-metrics-in-near-real-time-using-cluster-logs/m-p/134797#M50194</guid>
      <dc:creator>saicharandeepb</dc:creator>
      <dc:date>2025-10-14T05:55:20Z</dc:date>
    </item>
  </channel>
</rss>

