topic best practice for optimizedWrites and Optimize in Data Engineering

best practice for optimizedWrites and Optimize

User16783853501 — Wed, 23 Jun 2021 21:16:18 GMT

What is the best practice for a delta pipeline with very high throughput to avoid small files problem and also reduce the need for external OPTIMIZE frequently?

Re: best practice for optimizedWrites and Optimize

User16826994223 — Thu, 24 Jun 2021 01:28:02 GMT

A better way I can think of is-

Enable auto optimize (it will automatically create a file of 128 mb)

Enable Auto compact.

delta.autoOptimize.optimizeWrite = true
 delta.autoOptimize.autoCompact = true

complete guide-

https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/auto-optimize

Re: best practice for optimizedWrites and Optimize

sajith_appukutt — Thu, 24 Jun 2021 03:35:30 GMT

As kunal mentioned, delta.autoOptimize.optimizeWrite aims to create 128 mb files. If you have very high write throughput, and need low latency inserts, perhaps disable autoCompact by setting "delta.autoOptimize.autoCompact = false".

This pattern is convenient if you have the table partitioned by day and an append heavy pipeline - you could run a manual optimize and specify filter condition to exclude current day to reduce write conflicts

Re: best practice for optimizedWrites and Optimize

brickster_2018 — Thu, 24 Jun 2021 05:21:44 GMT

The general practice in use is to enable only optimize writes and disable auto-compaction. This is because the optimize writes will introduce an extra shuffle step which will increase the latency of the write operation. In addition to that, the auto-compaction will also introduce latency in the write - specifically in the commit operation. So running an optimize command on a daily basis is a general practice in use.

Re: best practice for optimizedWrites and Optimize

rajkve — Tue, 20 May 2025 15:36:31 GMT

Hi All,

Can anyone who has solved this challenge confirm if the below increases write latency and avoids creating smaller file, based a POC I did, I dont see that behaviour replicable, so I am just wondering. Many thanks.