07-22-2025 02:52 AM - edited 07-22-2025 02:53 AM
Hello,
I am running a spark streaming job that reads data from AWS Kinesis and writes data to extrenal delta tables which are stored in S3. But I have noticed that over the time, the latency has been increasing. I also noticed that for each batch, the addBatch and commitBatch time has been increasing. I am writing to table in append mode.
I did run an OPTIMISE and my latency improved along with reductions in addBatch and commitBatch duration.
I know that too many small files reduces read performance, but my question is does too many small streaming parquet files in delta table impact writes for the streaming job? But
07-22-2025 04:29 AM
But? please share your whole query.
07-22-2025 04:55 AM - edited 07-22-2025 05:15 AM
@Khaja_Zaffer Sorry the "But" is a typo. The query is simple: I read using readstream, extract the data and expand the json into a table structure and write the data back using writestream with append mode ,checkpoints and mergeSchema as True. It is something like:
(data
.writeStream
.outputMode("append")
.format("delta")
.queryName("query_name")
.option("checkpointLocation", checkpoint_location)
.option("header", "true")
.option("mergeSchema", "true")
.toTable(table_name)
)
I cannot share the complete query, but it is quite straightforward. And it becomes slower and slower as the table size grows.
07-22-2025 11:36 AM
Understood. @VaderKB
Please check this documentation around this : https://docs.databricks.com/aws/en/delta/best-practices
Databricks recommends frequently running the OPTIMIZE command to compact small files.
Please try to run this:
-- For a non-partitioned table
OPTIMIZE my_delta_table;
-- For a table partitioned by 'date', optimizing the last 2 days
OPTIMIZE my_delta_table WHERE date >= current_date() - INTERVAL 2 DAY;
I am waiting other in the forum to confirm this solution. @VaderKB I am sure this is not production environment right?
07-28-2025 01:33 AM
Hello @VaderKB
Did it resolve the issue?
07-29-2025 12:18 PM
@Khaja_Zaffer , Thank you for your answer. Yes this is in production environment. Doing an OPTIMIZE did reduce the latency. But I don't understand why? Because from what I understand, OPTIMIZE compacts the file into larger files. And from what I read in the documentation, it should make reads faster as there would be fewer files to read. But there is nothing on writes. However in case of writes with append mode, it should not be the case as the files are just added in top of each other. So it should not matter how many files already exists because we are just adding more. Not sure how it is implemented under the hood. But if you know why, please do let me know.
07-29-2025 09:17 PM
Yes, too many small Parquet files in a Delta table can degrade write performance by increasing metadata overhead during commits. Regularly running OPTIMIZE helps reduce this impact and improve streaming latency.
07-30-2025 01:23 AM
Hello @VaderKB
You're right that OPTIMIZE makes reads faster by reducing the number of files. For writes using append mode, it doesn't directly speed up the operation itself. However, having fewer, larger files from a previous OPTIMIZE run can improve the overall performance of subsequent reads that might be part of a larger job, which could make the entire pipeline seem faster.
You can go in detail from this document
https://www.databricks.com/discover/pages/optimize-data-workloads-guide
as the issue is resolved can you please close the case by selecting a solution from this page?
Thank you.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now