cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Recommendations to execute OPTIMIZE on tables

alejandrofm
Valued Contributor

Hi, have Databricks running on AWS, I'm looking for a way to know when is a good time to run optimize on partitioned tables. Taking into account that it's an expensive process, especially on big tables, how could I know if it's a good time to run it and on which tables? Maybe I can create a script that executes once a week and run optimize command only when is needed.

Also, it's a recommended practice to run OPTIMIZE for each partition on the same job that I create the partition?

So I create 2022-02-12 partition then run optimize where partition=2022-02-12, I'm not sure about this because of the "eventual consistency" of S3, maybe the files are not ready as soon as they are loaded and something can become corrupted.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

jose_gonzalez
Moderator
Moderator

Hi @Alejandro Martinez​ ,

How do you partition your Delta table? we dont recommend to create too many partitions on the same table.

If you execute "%sql describe details <delta_table_name>" how many files do you see in the "numFiles" column? You dont want to have millions of small files in your Delta table, so Optimize will compact files within the same partition to reduce the number of files. You can check your Delta table every week and according to the number of files, you can decide if you want to compact or not.

If you would like to take a look at the docs, please refer to this link https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-optimize

View solution in original post

3 REPLIES 3

jose_gonzalez
Moderator
Moderator

Hi @Alejandro Martinez​ ,

How do you partition your Delta table? we dont recommend to create too many partitions on the same table.

If you execute "%sql describe details <delta_table_name>" how many files do you see in the "numFiles" column? You dont want to have millions of small files in your Delta table, so Optimize will compact files within the same partition to reduce the number of files. You can check your Delta table every week and according to the number of files, you can decide if you want to compact or not.

If you would like to take a look at the docs, please refer to this link https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-optimize

Hi, we partition our tables by day, and a very little portion by date/hour when the hour filter is used a lot.

Thanks!

Anonymous
Not applicable

@Alejandro Martinez​ - If Jose's answer resolved your question, would you be happy to mark his answer as best? That helps other members find the answer more quickly.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.