06-08-2023 06:03 AM
We have a table containing records from the last 2-3 years. The table size is around 7.5 TBytes (67 Billion rows).
Because there are periodic updates on historical records and daily optimizations of this table, we have tried repeatedly to execute a manual VACUUM operation on aforementioned table.
We have tried the following with no success:
During our analysis of the corresponding filesystem, > 1TByte of data was the _delta_log where versions of table are tracked.
The questions we have are:
Kind Regards
06-09-2023 06:55 AM
It is difficult to say why the job kept running without taking a look at the logs. We, however, do recommend avoiding any concurrent runs at the same time as they will compete for the same bandwidth. This can hamper the performance of the vacuum.
You can refer to this kb document for best practices related to vacuum command:
https://kb.databricks.com/en_US/delta/vacuum-best-practices-on-delta-lake
06-08-2023 07:05 AM
Hi @EDDatabricks EDDatabricks , Let me try to answer all the questions:
Also, in addition to this, here are some additional insights from the information provided:
Table data is 7.5 TB and there are 500K files which means the average file size is 25 MB which is actually very less. My understanding is that you have partition strategy in your table which is generating very small partitions which is not efficient.
It is a good practice to run optimize and vacuum command regularly. But from the information provided, it looks like you have not run the vacuum commands in a long time. As there is more work for to do here for vacuum and optimize, it will take more time initially. But once the workload decreases, the time taken will also come down.
06-09-2023 06:55 AM
It is difficult to say why the job kept running without taking a look at the logs. We, however, do recommend avoiding any concurrent runs at the same time as they will compete for the same bandwidth. This can hamper the performance of the vacuum.
You can refer to this kb document for best practices related to vacuum command:
https://kb.databricks.com/en_US/delta/vacuum-best-practices-on-delta-lake
06-13-2023 12:31 AM
Hi @EDDatabricks EDDatabricks
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group