cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Impact of Capturing Streaming Metrics to ADLS on Data Load Performance

saicharandeepb
New Contributor II

Hi Community,

I’m working on capturing Structured Streaming metrics and persisting them to Azure Data Lake Storage (ADLS) for monitoring and logging. To achieve this, I implemented a custom StreamingQueryListener that writes streaming progress data as JSON files using the code snippet below.image (1).png

To avoid generating multiple small files, I used coalesce(1) to reduce the DataFrame to a single partition so that Spark writes only one output file per batch. While this approach functions as intended, I’ve noticed that writing these metrics—particularly with coalesce(1)—is negatively impacting the overall data load performance.

Has anyone experienced similar performance issues when writing streaming metrics directly to external storage like ADLS?

What are some recommended asynchronous or buffered strategies for capturing and storing streaming metrics without affecting the main data processing workflow?

1 REPLY 1

szymon_dybczak
Esteemed Contributor III

Hi @saicharandeepb ,

The behaviour you're experiencing can happen with coalesce. The thing is, when you use coalesce(1), you're sacrificing parallelism and everything is performed on a single executor.

There's even a warning in Apache Spark OSS regarding this:

szymon_dybczak_0-1757053069172.png

You can also check following posts/blogs:

apache spark - does coalesce(1) the dataframe before write have any impact on performance? - Stack O...

(22) Analyzing a 30x Slowdown in My Spark Program Due to Coalesce | LinkedIn