- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-15-2025 05:40 AM - edited 08-15-2025 05:42 AM
Hi everyone,
I recently worked on a similar requirement and would like to share a structured approach to monitoring Structured Streaming when writing to external sinks.
1. Use a Unique Query Name
Always assign a clear and meaningful name to each streaming query with .queryName("<your_query_name>"). This helps you easily identify the stream and its metrics in the Spark UI under the Streaming tab.
2. Leverage StreamingQueryListener for Metrics Reporting
Spark provides the StreamingQueryListener interface (available in both Python and Scala since Databricks Runtime 11.3 LTS) that allows you to capture lifecycle events like onQueryStarted, onQueryProgress, onQueryIdle, and onQueryTerminated.
Keep logic in these callbacks lightweight to avoid delaying processing—prefer writing metrics to a lightweight store such as Kafka or Prometheus.
3. Define and Observe Custom Metrics
Use the Observable API with .observe(...) to define custom metrics directly within your query—such as row counts, error counts, or data quality checks. These get emitted as events and can be picked up by your listener.
In your listener’s onQueryProgress, access these via event.progress.observedMetrics and handle alerts, dashboards, or logs as needed.
4. Capture Detailed Source and Sink Metrics
The event.progress object contains rich metrics related to source, state, and sink—such as input/output row counts, processing rates, offsets, backlog (e.g., offsets behind latest), and event time statistics.
This is particularly valuable when writing to external sinks like Kafka: you can monitor how many rows are actually delivered, detect lag, and track throughput.
5. Send Metrics to External Observability Tools
Use your listener to push structured metrics—for example, to Prometheus Pushgateway or similar systems. One common pattern is to serialize the progress as JSON and extract key metrics for real-time observability.
This enables dashboarding (e.g., via Grafana) and alerting on key events like increasing output latency or growing backlog.
6. Monitor External Sink Health
Ensure your listener also includes logic for sink connectivity and monitor whether records are being written successfully.
Track both successes and failures (e.g., network errors, backpressure). Combine these with built-in sink.numOutputRows metrics to get better visibility.
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa