cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Small/big file problem, how do you fix it ?

William_Scardua
Valued Contributor

How do you work to fixing the small/big file problem ? what you suggest ?

1 ACCEPTED SOLUTION

Accepted Solutions

jose_gonzalez
Moderator
Moderator

Hi @William Scardua​ ,

I will recommend to use Delta to avoid having small/big files issues. For example, Auto Optimize is an optional set of features that automatically compact small files during individual writes to a Delta table. Paying a small cost during writes offers significant benefits for tables that are queried actively. For more details and examples please check the following link

Auto optimize will create files of 128 MB each. If you would like to compress and optimize further, then I will recommend to use "Optimize" command on your Delta tables. It will compress and create files of 1 GB in size, by default. For more details on this optimize feature, please check the following link

Thank you.

View solution in original post

5 REPLIES 5

Kaniz
Community Manager
Community Manager

Hi @ William Scardua ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

jose_gonzalez
Moderator
Moderator

Hi @William Scardua​ ,

I will recommend to use Delta to avoid having small/big files issues. For example, Auto Optimize is an optional set of features that automatically compact small files during individual writes to a Delta table. Paying a small cost during writes offers significant benefits for tables that are queried actively. For more details and examples please check the following link

Auto optimize will create files of 128 MB each. If you would like to compress and optimize further, then I will recommend to use "Optimize" command on your Delta tables. It will compress and create files of 1 GB in size, by default. For more details on this optimize feature, please check the following link

Thank you.

Okay @Jose Gonzalez​ , I understand .. thank you man

-werners-
Esteemed Contributor III

What Jose said.

If you cannot use delta or do not want to:

the use of coalesce and repartition/partitioning is the way to define the file size.

There is no one ideal file size. It all depends on the use case, available cluster size, data flow downstream etc.

What you do want to avoid is a lot of small files (think only a few megabytes or kilobytes).

But there is nothing wrong with a single file of 2 MB.

That being said: delta lake makes this exercise way easier.

thank you for feedback @Werner Stinckens​ , that`s a good point

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.