best practice for optimizedWrites and Optimize
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 02:16 PM
What is the best practice for a delta pipeline with very high throughput to avoid small files problem and also reduce the need for external OPTIMIZE frequently?
- Labels:
-
Delta Pipeline
-
Values
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 06:28 PM
A better way I can think of is-
Enable auto optimize (it will automatically create a file of 128 mb)
Enable Auto compact.
delta.autoOptimize.optimizeWrite = true
delta.autoOptimize.autoCompact = true complete guide-
https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/auto-optimize
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 08:35 PM
As kunal mentioned, delta.autoOptimize.optimizeWrite aims to create 128 mb files. If you have very high write throughput, and need low latency inserts, perhaps disable autoCompact by setting "delta.autoOptimize.autoCompact = false".
This pattern is convenient if you have the table partitioned by day and an append heavy pipeline - you could run a manual optimize and specify filter condition to exclude current day to reduce write conflicts
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 10:21 PM
The general practice in use is to enable only optimize writes and disable auto-compaction. This is because the optimize writes will introduce an extra shuffle step which will increase the latency of the write operation. In addition to that, the auto-compaction will also introduce latency in the write - specifically in the commit operation. So running an optimize command on a daily basis is a general practice in use.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-20-2025 08:36 AM
Hi All,
Can anyone who has solved this challenge confirm if the below increases write latency and avoids creating smaller file, based a POC I did, I dont see that behaviour replicable, so I am just wondering. Many thanks.