- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-12-2022 01:30 PM
Hi, have Databricks running on AWS, I'm looking for a way to know when is a good time to run optimize on partitioned tables. Taking into account that it's an expensive process, especially on big tables, how could I know if it's a good time to run it and on which tables? Maybe I can create a script that executes once a week and run optimize command only when is needed.
Also, it's a recommended practice to run OPTIMIZE for each partition on the same job that I create the partition?
So I create 2022-02-12 partition then run optimize where partition=2022-02-12, I'm not sure about this because of the "eventual consistency" of S3, maybe the files are not ready as soon as they are loaded and something can become corrupted.
Thanks!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-16-2022 11:30 AM
Hi @Alejandro Martinezโ ,
How do you partition your Delta table? we dont recommend to create too many partitions on the same table.
If you execute "%sql describe details <delta_table_name>" how many files do you see in the "numFiles" column? You dont want to have millions of small files in your Delta table, so Optimize will compact files within the same partition to reduce the number of files. You can check your Delta table every week and according to the number of files, you can decide if you want to compact or not.
If you would like to take a look at the docs, please refer to this link https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-optimize
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-16-2022 11:30 AM
Hi @Alejandro Martinezโ ,
How do you partition your Delta table? we dont recommend to create too many partitions on the same table.
If you execute "%sql describe details <delta_table_name>" how many files do you see in the "numFiles" column? You dont want to have millions of small files in your Delta table, so Optimize will compact files within the same partition to reduce the number of files. You can check your Delta table every week and according to the number of files, you can decide if you want to compact or not.
If you would like to take a look at the docs, please refer to this link https://docs.databricks.com/delta/optimizations/file-mgmt.html#delta-optimize
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-17-2022 07:28 AM
Hi, we partition our tables by day, and a very little portion by date/hour when the hour filter is used a lot.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-17-2022 08:24 AM
@Alejandro Martinezโ - If Jose's answer resolved your question, would you be happy to mark his answer as best? That helps other members find the answer more quickly.