Hi Community,
Iām working on capturing Structured Streaming metrics and persisting them to Azure Data Lake Storage (ADLS) for monitoring and logging. To achieve this, I implemented a custom StreamingQueryListener that writes streaming progress data as JSON files using the code snippet below.
To avoid generating multiple small files, I used coalesce(1) to reduce the DataFrame to a single partition so that Spark writes only one output file per batch. While this approach functions as intended, Iāve noticed that writing these metricsāparticularly with coalesce(1)āis negatively impacting the overall data load performance.
Has anyone experienced similar performance issues when writing streaming metrics directly to external storage like ADLS?
What are some recommended asynchronous or buffered strategies for capturing and storing streaming metrics without affecting the main data processing workflow?