topic Re: Does too many parquet files in delta table impact writes for the streaming job in Data Engineering

Does too many parquet files in delta table impact writes for the streaming job

VaderKB — Tue, 22 Jul 2025 09:53:18 GMT

Hello,

I am running a spark streaming job that reads data from AWS Kinesis and writes data to extrenal delta tables which are stored in S3. But I have noticed that over the time, the latency has been increasing. I also noticed that for each batch, the addBatch and commitBatch time has been increasing. I am writing to table in append mode.

I did run an OPTIMISE and my latency improved along with reductions in addBatch and commitBatch duration.

I know that too many small files reduces read performance, but my question is does too many small streaming parquet files in delta table impact writes for the streaming job? But

Re: Does too many parquet files in delta table impact writes for the streaming job

Khaja_Zaffer — Tue, 22 Jul 2025 11:29:59 GMT

But? please share your whole query.

Re: Does too many parquet files in delta table impact writes for the streaming job

VaderKB — Tue, 22 Jul 2025 12:15:42 GMT

@Khaja_Zaffer Sorry the "But" is a typo. The query is simple: I read using readstream, extract the data and expand the json into a table structure and write the data back using writestream with append mode ,checkpoints and mergeSchema as True. It is something like:

(data .writeStream .outputMode("append") .format("delta") .queryName("query_name") .option("checkpointLocation", checkpoint_location) .option("header", "true") .option("mergeSchema", "true") .toTable(table_name) )

I cannot share the complete query, but it is quite straightforward. And it becomes slower and slower as the table size grows.

Re: Does too many parquet files in delta table impact writes for the streaming job

Khaja_Zaffer — Tue, 22 Jul 2025 18:36:57 GMT

Understood. @VaderKB

Please check this documentation around this : https://docs.databricks.com/aws/en/delta/best-practices
Databricks recommends frequently running the OPTIMIZE command to compact small files.

Please try to run this:

-- For a non-partitioned table
OPTIMIZE my_delta_table;

-- For a table partitioned by 'date', optimizing the last 2 days
OPTIMIZE my_delta_table WHERE date >= current_date() - INTERVAL 2 DAY;

I am waiting other in the forum to confirm this solution. @VaderKB I am sure this is not production environment right?

Re: Does too many parquet files in delta table impact writes for the streaming job

Khaja_Zaffer — Mon, 28 Jul 2025 08:33:42 GMT

Hello @VaderKB

Did it resolve the issue?

Re: Does too many parquet files in delta table impact writes for the streaming job

VaderKB — Tue, 29 Jul 2025 19:18:19 GMT

@Khaja_Zaffer , Thank you for your answer. Yes this is in production environment. Doing an OPTIMIZE did reduce the latency. But I don't understand why? Because from what I understand, OPTIMIZE compacts the file into larger files. And from what I read in the documentation, it should make reads faster as there would be fewer files to read. But there is nothing on writes. However in case of writes with append mode, it should not be the case as the files are just added in top of each other. So it should not matter how many files already exists because we are just adding more. Not sure how it is implemented under the hood. But if you know why, please do let me know.

Re: Does too many parquet files in delta table impact writes for the streaming job

jerrygen78 — Wed, 30 Jul 2025 04:17:07 GMT

Yes, too many small Parquet files in a Delta table can degrade write performance by increasing metadata overhead during commits. Regularly running OPTIMIZE helps reduce this impact and improve streaming latency.

Re: Does too many parquet files in delta table impact writes for the streaming job

Khaja_Zaffer — Wed, 30 Jul 2025 08:23:27 GMT

Hello @VaderKB

You're right that OPTIMIZE makes reads faster by reducing the number of files. For writes using append mode, it doesn't directly speed up the operation itself. However, having fewer, larger files from a previous OPTIMIZE run can improve the overall performance of subsequent reads that might be part of a larger job, which could make the entire pipeline seem faster.

You can go in detail from this document
https://www.databricks.com/discover/pages/optimize-data-workloads-guide

as the issue is resolved can you please close the case by selecting a solution from this page?

Thank you.