Hi Community,
I’m working on capturing Structured Streaming metrics and persisting them to Azure Data Lake Storage (ADLS) for monitoring and logging. To achieve this, I implemented a custom StreamingQueryListener that writes streaming progress data as JSON files using the code snippet below.
To avoid generating multiple small files, I used coalesce(1) to reduce the DataFrame to a single partition so that Spark writes only one output file per batch. While this approach functions as intended, I’ve noticed that writing these metrics—particularly with coalesce(1)—is negatively impacting the overall data load performance.
Has anyone experienced similar performance issues when writing streaming metrics directly to external storage like ADLS?
What are some recommended asynchronous or buffered strategies for capturing and storing streaming metrics without affecting the main data processing workflow?