cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Capturing Streaming Metrics in Near Real-Time Using Cluster Logs

saicharandeepb
New Contributor III

Over the past few weeks, I’ve been exploring ways to capture streaming metrics from our data load jobs. The goal is to monitor job performance and behavior in real time, without disrupting our existing data load pipelines.

Initial Exploration: StreamingQueryListener vs Cluster Logs

My initial approach involved using StreamingQueryListener, which provides hooks into Spark’s streaming lifecycle. While it only requires minor code changes, integrating it into production jobs could introduce risks or instability—something we wanted to avoid.

To keep things non-intrusive, I pivoted to cluster logs, which are already being generated and can be redirected to a volume for external processing. This method aligns well with our operational constraints and avoids touching the running jobs.

Observations After Enabling Cluster Logs

Once the logs were redirected to a volume, I noticed an interesting behavior in how Spark handles event logging:

Log Compression Behavior

  • When a streaming job starts, an eventlog file is created, and logs begin accumulating in it.
  • Every 10 minutes, the accumulated logs in the eventlog file are compressed into a new .gz file.
  • After compression, the eventlog file continues collecting logs for the next interval.
  • This cycle repeats throughout the job duration.

Example Scenario

Suppose a streaming job runs for 3 hours and 8 minutes. Here's what happens:

  • You’ll get 18 .gz files for the first 3 hours (3 hours × 6 files/hour).
  • The remaining 8 minutes of logs will still reside in the eventlog file, as they haven’t yet reached the next 10-minute compression threshold.

saicharandeepb_0-1760081131866.png

 

This rolling compression mechanism is efficient for log management but introduces complexity when trying to capture streaming metrics in real time, especially since the logs are split across multiple compressed files and the active eventlog file.

The Challenge: Real-Time Parsing of Compressed Logs

To extract metrics like input rate, processing rate, and batch duration, we need to:

  • Continuously monitor the volume for new .gz files
  • Decompress and parse each file as it arrives
  • Extract relevant metrics from the event log JSON structure
  • Stream the metrics to a monitoring or visualization system

However, Structured Streaming doesn’t natively support reading .gz files, which complicates real-time processing.

Scalability Concerns

One workaround I considered was using Azure Event Hub to stream the parsed metrics from each .gz file. But with over 100 data load pipelines, processing each file in a loop would likely fall short of our goal of near real-time monitoring. The latency introduced by sequential processing could make the metrics less actionable.

Open Call for Suggestions

I’m reaching out to the community to ask:

  • Has anyone implemented a similar approach using cluster logs?
  • How did you handle real-time parsing of compressed log files?
  • Are there scalable solutions or best practices you’d recommend?

If you’ve tackled this challenge or have ideas, I’d love to hear from you. One idea I’m exploring is using Event Hub to stream parsed metrics, but I’m open to better or more efficient alternatives.

 

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now