cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Recommendations to execute OPTIMIZE on tables

alejandrofm
Valued Contributor

Hi, have Databricks running on AWS, I'm looking for a way to know when is a good time to run optimize on partitioned tables. Taking into account that it's an expensive process, especially on big tables, how could I know if it's a good time to run it and on which tables? Maybe I can create a script that executes once a week and run optimize command only when is needed.

Also, it's a recommended practice to run OPTIMIZE for each partition on the same job that I create the partition?

So I create 2022-02-12 partition then run optimize where partition=2022-02-12, I'm not sure about this because of the "eventual consistency" of S3, maybe the files are not ready as soon as they are loaded and something can become corrupted.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

jose_gonzalez
Moderator
Moderator

Hi @Alejandro Martinez​ ,

How do you partition your Delta table? we dont recommend to create too many partitions on the same table.

If you execute "%sql describe details <delta_table_name>" how many files do you see in the "numFiles" column? You dont want to have millions of small files in your Delta table, so Optimize will compact files within the same partition to reduce the number of files. You can check your Delta table every week and according to the number of files, you can decide if you want to compact or not.

If you would like to take a look at the docs, please refer to this link https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-optimize

View solution in original post

3 REPLIES 3

jose_gonzalez
Moderator
Moderator

Hi @Alejandro Martinez​ ,

How do you partition your Delta table? we dont recommend to create too many partitions on the same table.

If you execute "%sql describe details <delta_table_name>" how many files do you see in the "numFiles" column? You dont want to have millions of small files in your Delta table, so Optimize will compact files within the same partition to reduce the number of files. You can check your Delta table every week and according to the number of files, you can decide if you want to compact or not.

If you would like to take a look at the docs, please refer to this link https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-optimize

Hi, we partition our tables by day, and a very little portion by date/hour when the hour filter is used a lot.

Thanks!

Anonymous
Not applicable

@Alejandro Martinez​ - If Jose's answer resolved your question, would you be happy to mark his answer as best? That helps other members find the answer more quickly.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!