Databricks Community

VaderKB · ‎07-22-2025

Hello,

I am running a spark streaming job that reads data from AWS Kinesis and writes data to extrenal delta tables which are stored in S3. But I have noticed that over the time, the latency has been increasing. I also noticed that for each batch, the addBatch and commitBatch time has been increasing. I am writing to table in append mode.

I did run an OPTIMISE and my latency improved along with reductions in addBatch and commitBatch duration.

I know that too many small files reduces read performance, but my question is does too many small streaming parquet files in delta table impact writes for the streaming job? But

Khaja_Zaffer · ‎07-22-2025

But? please share your whole query.

VaderKB · ‎07-22-2025

@Khaja_Zaffer Sorry the "But" is a typo. The query is simple: I read using readstream, extract the data and expand the json into a table structure and write the data back using writestream with append mode ,checkpoints and mergeSchema as True. It is something like:

(data
.writeStream
.outputMode("append")
.format("delta")
.queryName("query_name")
.option("checkpointLocation", checkpoint_location)
.option("header", "true")
.option("mergeSchema", "true")
.toTable(table_name)
)

I cannot share the complete query, but it is quite straightforward. And it becomes slower and slower as the table size grows.

Khaja_Zaffer · ‎07-22-2025

Understood. @VaderKB

Please check this documentation around this : https://docs.databricks.com/aws/en/delta/best-practices
Databricks recommends frequently running the OPTIMIZE command to compact small files.

Please try to run this:

-- For a non-partitioned table
OPTIMIZE my_delta_table;

-- For a table partitioned by 'date', optimizing the last 2 days
OPTIMIZE my_delta_table WHERE date >= current_date() - INTERVAL 2 DAY;

I am waiting other in the forum to confirm this solution. @VaderKB I am sure this is not production environment right?

Khaja_Zaffer · ‎07-28-2025

Hello @VaderKB

Did it resolve the issue?

VaderKB · ‎07-29-2025

@Khaja_Zaffer , Thank you for your answer. Yes this is in production environment. Doing an OPTIMIZE did reduce the latency. But I don't understand why? Because from what I understand, OPTIMIZE compacts the file into larger files. And from what I read in the documentation, it should make reads faster as there would be fewer files to read. But there is nothing on writes. However in case of writes with append mode, it should not be the case as the files are just added in top of each other. So it should not matter how many files already exists because we are just adding more. Not sure how it is implemented under the hood. But if you know why, please do let me know.

jerrygen78 · ‎07-29-2025

Yes, too many small Parquet files in a Delta table can degrade write performance by increasing metadata overhead during commits. Regularly running OPTIMIZE helps reduce this impact and improve streaming latency.

Khaja_Zaffer · ‎07-30-2025

Hello @VaderKB

You're right that OPTIMIZE makes reads faster by reducing the number of files. For writes using append mode, it doesn't directly speed up the operation itself. However, having fewer, larger files from a previous OPTIMIZE run can improve the overall performance of subsequent reads that might be part of a larger job, which could make the entire pipeline seem faster.

You can go in detail from this document
https://www.databricks.com/discover/pages/optimize-data-workloads-guide

as the issue is resolved can you please close the case by selecting a solution from this page?

Thank you.

Databricks Community

Does too many parquet files in delta table impact writes for the streaming job

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples