- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 06:34 AM
Delta creates more small files during merge and updates operations.
- Labels:
-
Delta
-
Large Number
-
Small Files
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 06:45 AM
Delta solves the large number of small file problems using the below operations available for a Delta table.
- Optimize writes helps to optimizes the write operation by adding an additional shuffle step and reducing the number of output files. By default, the file size will be of the order of 128MB. This ensures very small files are not created during write.
- Auto-compaction - helps to compact small files. Although optimize writes helps to create larger files, it's possible the write operation does not have adequate data to create files of the size 128 MB. This usually happens for streaming jobs where the data coming in a micro-batch can end up creating smaller files. Auto-compaction is kicked in once the table directory/table partition directory had 50 small files. These default configurations can be modified. Auto-compaction is triggered as a post-commit hook.
- Last but not least is the Bin packing or regular optimize operations. The optimize command helps to bin pack the data from various files into a single file . The output file size by default will be of the order of 1 GB. Optimize command takes an optional argument which takes the name of the columns on which co-locality can be ensured. This is referred as Z-ORDER.
Read more here:
https://docs.databricks.com/delta/optimizations/auto-optimize.html
https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 06:45 AM
Delta solves the large number of small file problems using the below operations available for a Delta table.
- Optimize writes helps to optimizes the write operation by adding an additional shuffle step and reducing the number of output files. By default, the file size will be of the order of 128MB. This ensures very small files are not created during write.
- Auto-compaction - helps to compact small files. Although optimize writes helps to create larger files, it's possible the write operation does not have adequate data to create files of the size 128 MB. This usually happens for streaming jobs where the data coming in a micro-batch can end up creating smaller files. Auto-compaction is kicked in once the table directory/table partition directory had 50 small files. These default configurations can be modified. Auto-compaction is triggered as a post-commit hook.
- Last but not least is the Bin packing or regular optimize operations. The optimize command helps to bin pack the data from various files into a single file . The output file size by default will be of the order of 1 GB. Optimize command takes an optional argument which takes the name of the columns on which co-locality can be ensured. This is referred as Z-ORDER.
Read more here:
https://docs.databricks.com/delta/optimizations/auto-optimize.html
https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-13-2024 09:57 AM
This doesn't help until after you've already loaded the original file set, correct? Doing the one-time initial load of the many small files is still not great performance/speed. Correct?

