cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Does too many parquet files in delta table impact writes for the streaming job

VaderKB
New Contributor II

Hello,

I am running a spark streaming job that reads data from AWS Kinesis and writes data to extrenal delta tables which are stored in S3. But I have noticed that over the time, the latency has been increasing. I also noticed that for each batch, the addBatch and commitBatch time has been increasing. I am writing to table in append mode.

I did run an OPTIMISE and my latency improved along with reductions in addBatch and commitBatch duration. 

I know that too many small files reduces read performance, but my question is does too many small streaming parquet files in delta table impact writes for the streaming job? But 

7 REPLIES 7

Khaja_Zaffer
Contributor

But? please share your whole query. 

VaderKB
New Contributor II

@Khaja_Zaffer Sorry the "But" is a typo. The query is simple: I read using readstream, extract the data and expand the json into a table structure and write the data back using writestream with append mode ,checkpoints and mergeSchema as True. It is something like:

(data
.writeStream
.outputMode("append")
.format("delta")
.queryName("query_name")
.option("checkpointLocation", checkpoint_location)
.option("header", "true")
.option("mergeSchema", "true")
.toTable(table_name)
)

I cannot share the complete query, but it is quite straightforward. And it becomes slower and slower as the table size grows.

Khaja_Zaffer
Contributor

Understood. @VaderKB 

Please check this documentation around this : https://docs.databricks.com/aws/en/delta/best-practices
Databricks recommends frequently running the OPTIMIZE command to compact small files.

Please try to run this: 

-- For a non-partitioned table
OPTIMIZE my_delta_table;

-- For a table partitioned by 'date', optimizing the last 2 days
OPTIMIZE my_delta_table WHERE date >= current_date() - INTERVAL 2 DAY;

I am waiting other in the forum to confirm this solution. @VaderKB  I am sure this is not production environment right?

Khaja_Zaffer
Contributor

Hello @VaderKB 

Did it resolve the issue?

@Khaja_Zaffer , Thank you for your answer. Yes this is in production environment. Doing an OPTIMIZE did reduce the latency. But I don't understand why? Because from what I understand, OPTIMIZE compacts the file into larger files. And from what I read in the documentation, it should make reads faster as there would be fewer files to read. But there is nothing on writes. However in case of writes with append mode, it should not be the case as the files are just added in top of each other. So it should not matter how many files already exists because we are just adding more. Not sure how it is implemented under the hood. But if you know why, please do let me know.

jerrygen78
New Contributor III

Yes, too many small Parquet files in a Delta table can degrade write performance by increasing metadata overhead during commits. Regularly running OPTIMIZE helps reduce this impact and improve streaming latency.

Khaja_Zaffer
Contributor

Hello @VaderKB 

You're right that OPTIMIZE makes reads faster by reducing the number of files. For writes using append mode, it doesn't directly speed up the operation itself. However, having fewer, larger files from a previous OPTIMIZE run can improve the overall performance of subsequent reads that might be part of a larger job, which could make the entire pipeline seem faster.

You can go in detail from this document 
https://www.databricks.com/discover/pages/optimize-data-workloads-guide

as the issue is resolved can you please close the case by selecting a solution from this page?

 

Thank you. 

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now