cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Too many small files from updates

pradeepvatsvk
New Contributor II

Hi ,

I am updating some data into a delta table , each time I  only need to  update  one row due to which after every update statement it is creating new file, How do I tackle this issue , it doesn't make sense to run optimize command after every update command

6 REPLIES 6

JakubSkibicki
Contributor

Usually this problem is solved with autooptimize property.

https://docs.databricks.com/en/delta/tune-file-size.html#auto-compaction-for-delta-lake-on-databrick...

For Managed tables this option is enabled by default 

 

Rjdudley
Valued Contributor II

Depending on your table settings, those may be log files or version use for Time Travel.  Unless you've mastered partitioning, you really shouldn't worry about the files and let the system do what it does.

But it is making a performance head , every update command is taking more time than previous since it has to filter in more files

 

saurabh18cs
Valued Contributor III

set following spark session properties and give a try:

 

'spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite': 'true'
'spark.databricks.delta.optimizeWrite.enabled': 'true'
'spark.sql.shuffle.partitions': 'auto'

Rjdudley
Valued Contributor II

OK something isn't right, I work with massive datasets and this is not an issue for a single update.  If your architecture and Unity Catalog configuration is correct, and there's also not some weird bug, you should not be aware of the underlying files.  Are you working against the data files directly or are you querying the tables in Unity Catalog?

Lakshay
Databricks Employee
Databricks Employee

If you performing 100s of update operations on the delta table, you can opt to run an optimize operation after a batch of 100 updates. There should be no significant performance issue up to 100 such updates

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group