Databricks Community

saicharandeepb · Thursday

Hi Community,

I’m working on capturing Structured Streaming metrics and persisting them to Azure Data Lake Storage (ADLS) for monitoring and logging. To achieve this, I implemented a custom StreamingQueryListener that writes streaming progress data as JSON files using the code snippet below.

To avoid generating multiple small files, I used coalesce(1) to reduce the DataFrame to a single partition so that Spark writes only one output file per batch. While this approach functions as intended, I’ve noticed that writing these metrics—particularly with coalesce(1)—is negatively impacting the overall data load performance.

Has anyone experienced similar performance issues when writing streaming metrics directly to external storage like ADLS?

What are some recommended asynchronous or buffered strategies for capturing and storing streaming metrics without affecting the main data processing workflow?

szymon_dybczak · Thursday

Hi @saicharandeepb ,

The behaviour you're experiencing can happen with coalesce. The thing is, when you use coalesce(1), you're sacrificing parallelism and everything is performed on a single executor.

There's even a warning in Apache Spark OSS regarding this:

You can also check following posts/blogs:

apache spark - does coalesce(1) the dataframe before write have any impact on performance? - Stack O...

(22) Analyzing a 30x Slowdown in My Spark Program Due to Coalesce | LinkedIn

Databricks Community

Impact of Capturing Streaming Metrics to ADLS on Data Load Performance

Join Us as a Local Community Builder!

🌟 Community Sparks of the Week | Aug 29 – Sept 4 🌟

Introducing Databricks Assistant Data Science Agent

Databricks Community Champion - August 2025 - Benjamin Stringer

🚀 Weekly Delta (27 August - 3 September): A Look Back at This Week’s Top Community Highlights!

Virtual Learning Festival: 10 October - 31 October 2025