Databricks Community

brickster_2018 · ‎06-23-2021

Delta creates more small files during merge and updates operations.

brickster_2018 · ‎06-23-2021

Delta solves the large number of small file problems using the below operations available for a Delta table.

Optimize writes helps to optimizes the write operation by adding an additional shuffle step and reducing the number of output files. By default, the file size will be of the order of 128MB. This ensures very small files are not created during write.
Auto-compaction - helps to compact small files. Although optimize writes helps to create larger files, it's possible the write operation does not have adequate data to create files of the size 128 MB. This usually happens for streaming jobs where the data coming in a micro-batch can end up creating smaller files. Auto-compaction is kicked in once the table directory/table partition directory had 50 small files. These default configurations can be modified. Auto-compaction is triggered as a post-commit hook.
Last but not least is the Bin packing or regular optimize operations. The optimize command helps to bin pack the data from various files into a single file . The output file size by default will be of the order of 1 GB. Optimize command takes an optional argument which takes the name of the columns on which co-locality can be ensured. This is referred as Z-ORDER.

Read more here:

brickster_2018 · ‎06-23-2021

Delta solves the large number of small file problems using the below operations available for a Delta table.

Optimize writes helps to optimizes the write operation by adding an additional shuffle step and reducing the number of output files. By default, the file size will be of the order of 128MB. This ensures very small files are not created during write.
Auto-compaction - helps to compact small files. Although optimize writes helps to create larger files, it's possible the write operation does not have adequate data to create files of the size 128 MB. This usually happens for streaming jobs where the data coming in a micro-batch can end up creating smaller files. Auto-compaction is kicked in once the table directory/table partition directory had 50 small files. These default configurations can be modified. Auto-compaction is triggered as a post-commit hook.
Last but not least is the Bin packing or regular optimize operations. The optimize command helps to bin pack the data from various files into a single file . The output file size by default will be of the order of 1 GB. Optimize command takes an optional argument which takes the name of the columns on which co-locality can be ensured. This is referred as Z-ORDER.

How does Delta solve the large number of small file problems?