cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How does Delta solve the large number of small file problems?

User16869510359
Esteemed Contributor

Delta creates more small files during merge and updates operations.

1 ACCEPTED SOLUTION

Accepted Solutions

User16869510359
Esteemed Contributor

Delta solves the large number of small file problems using the below operations available for a Delta table.

  • Optimize writes helps to optimizes the write operation by adding an additional shuffle step and reducing the number of output files. By default, the file size will be of the order of 128MB. This ensures very small files are not created during write.
  • Auto-compaction - helps to compact small files. Although optimize writes helps to create larger files, it's possible the write operation does not have adequate data to create files of the size 128 MB. This usually happens for streaming jobs where the data coming in a micro-batch can end up creating smaller files. Auto-compaction is kicked in once the table directory/table partition directory had 50 small files. These default configurations can be modified. Auto-compaction is triggered as a post-commit hook.
  • Last but not least is the Bin packing or regular optimize operations. The optimize command helps to bin pack the data from various files into a single file . The output file size by default will be of the order of 1 GB. Optimize command takes an optional argument which takes the name of the columns on which co-locality can be ensured. This is referred as Z-ORDER.

Read more here:

https://docs.databricks.com/delta/optimizations/auto-optimize.html

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html

View solution in original post

2 REPLIES 2

User16869510359
Esteemed Contributor

Delta solves the large number of small file problems using the below operations available for a Delta table.

  • Optimize writes helps to optimizes the write operation by adding an additional shuffle step and reducing the number of output files. By default, the file size will be of the order of 128MB. This ensures very small files are not created during write.
  • Auto-compaction - helps to compact small files. Although optimize writes helps to create larger files, it's possible the write operation does not have adequate data to create files of the size 128 MB. This usually happens for streaming jobs where the data coming in a micro-batch can end up creating smaller files. Auto-compaction is kicked in once the table directory/table partition directory had 50 small files. These default configurations can be modified. Auto-compaction is triggered as a post-commit hook.
  • Last but not least is the Bin packing or regular optimize operations. The optimize command helps to bin pack the data from various files into a single file . The output file size by default will be of the order of 1 GB. Optimize command takes an optional argument which takes the name of the columns on which co-locality can be ensured. This is referred as Z-ORDER.

Read more here:

https://docs.databricks.com/delta/optimizations/auto-optimize.html

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html

This doesn't help until after you've already loaded the original file set, correct?  Doing the one-time initial load of the many small files is still not great performance/speed. Correct?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.