Databricks Community

saicharandeepb · ‎10-10-2025

Over the past few weeks, I’ve been exploring ways to capture streaming metrics from our data load jobs. The goal is to monitor job performance and behavior in real time, without disrupting our existing data load pipelines.

Initial Exploration: StreamingQueryListener vs Cluster Logs

My initial approach involved using StreamingQueryListener, which provides hooks into Spark’s streaming lifecycle. While it only requires minor code changes, integrating it into production jobs could introduce risks or instability—something we wanted to avoid.

To keep things non-intrusive, I pivoted to cluster logs, which are already being generated and can be redirected to a volume for external processing. This method aligns well with our operational constraints and avoids touching the running jobs.

Observations After Enabling Cluster Logs

Once the logs were redirected to a volume, I noticed an interesting behavior in how Spark handles event logging:

Log Compression Behavior

When a streaming job starts, an eventlog file is created, and logs begin accumulating in it.
Every 10 minutes, the accumulated logs in the eventlog file are compressed into a new .gz file.
After compression, the eventlog file continues collecting logs for the next interval.
This cycle repeats throughout the job duration.

Example Scenario

Suppose a streaming job runs for 3 hours and 8 minutes. Here's what happens:

You’ll get 18 .gz files for the first 3 hours (3 hours × 6 files/hour).
The remaining 8 minutes of logs will still reside in the eventlog file, as they haven’t yet reached the next 10-minute compression threshold.

This rolling compression mechanism is efficient for log management but introduces complexity when trying to capture streaming metrics in real time, especially since the logs are split across multiple compressed files and the active eventlog file.

The Challenge: Real-Time Parsing of Compressed Logs

To extract metrics like input rate, processing rate, and batch duration, we need to:

Continuously monitor the volume for new .gz files
Decompress and parse each file as it arrives
Extract relevant metrics from the event log JSON structure
Stream the metrics to a monitoring or visualization system

However, Structured Streaming doesn’t natively support reading .gz files, which complicates real-time processing.

Scalability Concerns

One workaround I considered was using Azure Event Hub to stream the parsed metrics from each .gz file. But with over 100 data load pipelines, processing each file in a loop would likely fall short of our goal of near real-time monitoring. The latency introduced by sequential processing could make the metrics less actionable.

Open Call for Suggestions

I’m reaching out to the community to ask:

Has anyone implemented a similar approach using cluster logs?
How did you handle real-time parsing of compressed log files?
Are there scalable solutions or best practices you’d recommend?

If you’ve tackled this challenge or have ideas, I’d love to hear from you. One idea I’m exploring is using Event Hub to stream parsed metrics, but I’m open to better or more efficient alternatives.

Krishna_S · ‎10-11-2025

Hi @saicharandeepb

Good job on doing such detailed research on monitoring structured streaming. If you need lower latency than rolling log permits, then have you tried this:
Cluster-wide listener injection: Use spark.extraListeners to register a custom listener JAR that emits StreamingQueryProgress metrics directly to your sink (Event Hubs, Delta). This avoids per-job code changes but does alter cluster config and requires packaging and maintenance. This will inject a cluster-wide listener using spark.extraListeners so every job on the cluster automatically emits Structured Streaming StreamingQueryProgress metrics to your sink (Event Hubs, Delta, etc.).
Pros: no per-streaming code changes.
Cons: you’re changing cluster behavior for all jobs and you now own a JAR + config.

saicharandeepb · ‎10-13-2025

Thanks a lot for the suggestion and for highlighting the cluster-wide listener injection approach using spark.extraListeners. We did consider this option during our initial evaluation and were quite intrigued by its potential to eliminate per-job code changes.
However, we eventually stepped back from it due to the overhead of maintaining a custom JAR especially in a shared environment where changes can have broader implications. For now, we’ve inclined towards the rolling log-based approach as it doesn’t require job-level changes and avoids the overhead of maintaining custom JARs making it more manageable and scalable for our current setup
That said, we’re definitely open to learning more and exploring other strategies and would love to hear more from you or others in the community.

Databricks Community

Capturing Streaming Metrics in Near Real-Time Using Cluster Logs

Join Us as a Local Community Builder!

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Level Up with Databricks Specialist Sessions

🌟 Community Pulse: Your Weekly Roundup! November 07 – 13, 2025

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Portland Data + AI Meetup — Holiday Event - Wednesday, December 3rd