02-12-2022 01:30 PM
Hi, have Databricks running on AWS, I'm looking for a way to know when is a good time to run optimize on partitioned tables. Taking into account that it's an expensive process, especially on big tables, how could I know if it's a good time to run it and on which tables? Maybe I can create a script that executes once a week and run optimize command only when is needed.
Also, it's a recommended practice to run OPTIMIZE for each partition on the same job that I create the partition?
So I create 2022-02-12 partition then run optimize where partition=2022-02-12, I'm not sure about this because of the "eventual consistency" of S3, maybe the files are not ready as soon as they are loaded and something can become corrupted.
Thanks!
02-16-2022 11:30 AM
Hi @Alejandro Martinez ,
How do you partition your Delta table? we dont recommend to create too many partitions on the same table.
If you execute "%sql describe details <delta_table_name>" how many files do you see in the "numFiles" column? You dont want to have millions of small files in your Delta table, so Optimize will compact files within the same partition to reduce the number of files. You can check your Delta table every week and according to the number of files, you can decide if you want to compact or not.
If you would like to take a look at the docs, please refer to this link https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-optimize
02-16-2022 11:30 AM
Hi @Alejandro Martinez ,
How do you partition your Delta table? we dont recommend to create too many partitions on the same table.
If you execute "%sql describe details <delta_table_name>" how many files do you see in the "numFiles" column? You dont want to have millions of small files in your Delta table, so Optimize will compact files within the same partition to reduce the number of files. You can check your Delta table every week and according to the number of files, you can decide if you want to compact or not.
If you would like to take a look at the docs, please refer to this link https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-optimize
02-17-2022 07:28 AM
Hi, we partition our tables by day, and a very little portion by date/hour when the hour filter is used a lot.
Thanks!
02-17-2022 08:24 AM
@Alejandro Martinez - If Jose's answer resolved your question, would you be happy to mark his answer as best? That helps other members find the answer more quickly.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.