Over the past few weeks, I’ve been exploring ways to capture streaming metrics from our data load jobs. The goal is to monitor job performance and behavior in real time, without disrupting our existing data load pipelines.
Initial Exploration: StreamingQueryListener vs Cluster Logs
My initial approach involved using StreamingQueryListener, which provides hooks into Spark’s streaming lifecycle. While it only requires minor code changes, integrating it into production jobs could introduce risks or instability—something we wanted to avoid.
To keep things non-intrusive, I pivoted to cluster logs, which are already being generated and can be redirected to a volume for external processing. This method aligns well with our operational constraints and avoids touching the running jobs.
Observations After Enabling Cluster Logs
Once the logs were redirected to a volume, I noticed an interesting behavior in how Spark handles event logging:
Log Compression Behavior
- When a streaming job starts, an eventlog file is created, and logs begin accumulating in it.
- Every 10 minutes, the accumulated logs in the eventlog file are compressed into a new .gz file.
- After compression, the eventlog file continues collecting logs for the next interval.
- This cycle repeats throughout the job duration.
Example Scenario
Suppose a streaming job runs for 3 hours and 8 minutes. Here's what happens:
- You’ll get 18 .gz files for the first 3 hours (3 hours × 6 files/hour).
- The remaining 8 minutes of logs will still reside in the eventlog file, as they haven’t yet reached the next 10-minute compression threshold.

This rolling compression mechanism is efficient for log management but introduces complexity when trying to capture streaming metrics in real time, especially since the logs are split across multiple compressed files and the active eventlog file.
The Challenge: Real-Time Parsing of Compressed Logs
To extract metrics like input rate, processing rate, and batch duration, we need to:
- Continuously monitor the volume for new .gz files
- Decompress and parse each file as it arrives
- Extract relevant metrics from the event log JSON structure
- Stream the metrics to a monitoring or visualization system
However, Structured Streaming doesn’t natively support reading .gz files, which complicates real-time processing.
Scalability Concerns
One workaround I considered was using Azure Event Hub to stream the parsed metrics from each .gz file. But with over 100 data load pipelines, processing each file in a loop would likely fall short of our goal of near real-time monitoring. The latency introduced by sequential processing could make the metrics less actionable.
Open Call for Suggestions
I’m reaching out to the community to ask:
- Has anyone implemented a similar approach using cluster logs?
- How did you handle real-time parsing of compressed log files?
- Are there scalable solutions or best practices you’d recommend?
If you’ve tackled this challenge or have ideas, I’d love to hear from you. One idea I’m exploring is using Event Hub to stream parsed metrics, but I’m open to better or more efficient alternatives.