โ10-06-2021 06:13 PM
How do you work to fixing the small/big file problem ? what you suggest ?
โ10-07-2021 09:54 AM
Hi @William Scarduaโ ,
I will recommend to use Delta to avoid having small/big files issues. For example, Auto Optimize is an optional set of features that automatically compact small files during individual writes to a Delta table. Paying a small cost during writes offers significant benefits for tables that are queried actively. For more details and examples please check the following link
Auto optimize will create files of 128 MB each. If you would like to compress and optimize further, then I will recommend to use "Optimize" command on your Delta tables. It will compress and create files of 1 GB in size, by default. For more details on this optimize feature, please check the following link
Thank you.
โ10-07-2021 09:54 AM
Hi @William Scarduaโ ,
I will recommend to use Delta to avoid having small/big files issues. For example, Auto Optimize is an optional set of features that automatically compact small files during individual writes to a Delta table. Paying a small cost during writes offers significant benefits for tables that are queried actively. For more details and examples please check the following link
Auto optimize will create files of 128 MB each. If you would like to compress and optimize further, then I will recommend to use "Optimize" command on your Delta tables. It will compress and create files of 1 GB in size, by default. For more details on this optimize feature, please check the following link
Thank you.
โ10-11-2021 02:14 PM
Okay @Jose Gonzalezโ , I understand .. thank you man
โ10-08-2021 12:01 AM
What Jose said.
If you cannot use delta or do not want to:
the use of coalesce and repartition/partitioning is the way to define the file size.
There is no one ideal file size. It all depends on the use case, available cluster size, data flow downstream etc.
What you do want to avoid is a lot of small files (think only a few megabytes or kilobytes).
But there is nothing wrong with a single file of 2 MB.
That being said: delta lake makes this exercise way easier.
โ10-11-2021 02:16 PM
thank you for feedback @Werner Stinckensโ , that`s a good point
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group