cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

best practice for optimizedWrites and Optimize

User16783853501
New Contributor II
New Contributor II

What is the best practice for a delta pipeline with very high throughput to avoid small files problem and also reduce the need for external OPTIMIZE frequently?  

3 REPLIES 3

User16826994223
Honored Contributor III

A better way I can think of is-

Enable auto optimize (it will automatically create a file of 128 mb)

Enable Auto compact.

delta.autoOptimize.optimizeWrite = true
 delta.autoOptimize.autoCompact = true 

​complete guide-

https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/auto-optimize

sajith_appukutt
Honored Contributor II

As kunal mentioned, delta.autoOptimize.optimizeWrite aims to create 128 mb files. If you have very high write throughput, and need low latency inserts, perhaps disable autoCompact by setting "delta.autoOptimize.autoCompact = false".

This pattern is convenient if you have the table partitioned by day and an append heavy pipeline - you could run a manual optimize and specify filter condition to exclude current day to reduce write conflicts

User16869510359
Esteemed Contributor

The general practice in use is to enable only optimize writes and disable auto-compaction. This is because the optimize writes will introduce an extra shuffle step which will increase the latency of the write operation. In addition to that, the auto-compaction will also introduce latency in the write - specifically in the commit operation. So running an optimize command on a daily basis is a general practice in use.