Databricks Community

pradeepvatsvk · ‎01-30-2025

Hi ,

I am updating some data into a delta table , each time I only need to update one row due to which after every update statement it is creating new file, How do I tackle this issue , it doesn't make sense to run optimize command after every update command

JakubSkibicki · ‎01-30-2025

Usually this problem is solved with autooptimize property.

https://docs.databricks.com/en/delta/tune-file-size.html#auto-compaction-for-delta-lake-on-databrick...

For Managed tables this option is enabled by default

Rjdudley · ‎01-30-2025

Depending on your table settings, those may be log files or version use for Time Travel. Unless you've mastered partitioning, you really shouldn't worry about the files and let the system do what it does.

pradeepvatsvk · ‎01-31-2025

But it is making a performance head , every update command is taking more time than previous since it has to filter in more files

saurabh18cs · ‎01-31-2025

set following spark session properties and give a try:

'spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite': 'true'

'spark.databricks.delta.optimizeWrite.enabled': 'true'

'spark.sql.shuffle.partitions': 'auto'

Rjdudley · ‎01-31-2025

OK something isn't right, I work with massive datasets and this is not an issue for a single update. If your architecture and Unity Catalog configuration is correct, and there's also not some weird bug, you should not be aware of the underlying files. Are you working against the data files directly or are you querying the tables in Unity Catalog?

Lakshay · ‎01-31-2025

If you performing 100s of update operations on the delta table, you can opt to run an optimize operation after a batch of 100 updates. There should be no significant performance issue up to 100 such updates

Databricks Community

Too many small files from updates

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!