cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Recommendations to execute OPTIMIZE on tables

alejandrofm
Valued Contributor

Hi, have Databricks running on AWS, I'm looking for a way to know when is a good time to run optimize on partitioned tables. Taking into account that it's an expensive process, especially on big tables, how could I know if it's a good time to run it and on which tables? Maybe I can create a script that executes once a week and run optimize command only when is needed.

Also, it's a recommended practice to run OPTIMIZE for each partition on the same job that I create the partition?

So I create 2022-02-12 partition then run optimize where partition=2022-02-12, I'm not sure about this because of the "eventual consistency" of S3, maybe the files are not ready as soon as they are loaded and something can become corrupted.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Alejandro Martinez​ ,

How do you partition your Delta table? we dont recommend to create too many partitions on the same table.

If you execute "%sql describe details <delta_table_name>" how many files do you see in the "numFiles" column? You dont want to have millions of small files in your Delta table, so Optimize will compact files within the same partition to reduce the number of files. You can check your Delta table every week and according to the number of files, you can decide if you want to compact or not.

If you would like to take a look at the docs, please refer to this link https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-optimize

View solution in original post

3 REPLIES 3

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Alejandro Martinez​ ,

How do you partition your Delta table? we dont recommend to create too many partitions on the same table.

If you execute "%sql describe details <delta_table_name>" how many files do you see in the "numFiles" column? You dont want to have millions of small files in your Delta table, so Optimize will compact files within the same partition to reduce the number of files. You can check your Delta table every week and according to the number of files, you can decide if you want to compact or not.

If you would like to take a look at the docs, please refer to this link https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-optimize

Hi, we partition our tables by day, and a very little portion by date/hour when the hour filter is used a lot.

Thanks!

Anonymous
Not applicable

@Alejandro Martinez​ - If Jose's answer resolved your question, would you be happy to mark his answer as best? That helps other members find the answer more quickly.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group