Databricks Community

EDDatabricks · ‎06-08-2023

We have a table containing records from the last 2-3 years. The table size is around 7.5 TBytes (67 Billion rows).

Because there are periodic updates on historical records and daily optimizations of this table, we have tried repeatedly to execute a manual VACUUM operation on aforementioned table.

We have tried the following with no success:

SMALL VACUUMS on 2-3 days intervals
- number of files reduced
- similar memory
- no successful completion or significant memory reduction
OPTIMIZATION & VACUUM
- whole table optimization firstly
- VACUUM on optimized table
- no successful completion or significant memory reduction
VACUUM for several days/week
- runtime of multiple days (> 3 weeks)
- number of files reduced (6M to 500K files)
- no actual memory reduction was observed (still > 7.5 TBytes)
- no successful completion

During our analysis of the corresponding filesystem, > 1TByte of data was the _delta_log where versions of table are tracked.

The questions we have are:

Whether the log size is logical or is there any issue with the state of the table
If there is any way to estimate the actual time needed to perform the VACUUM operation (it is becoming increasingly costly to try such long operations)
Why there is no significant reduction in memory (previous discussions with databricks support team indicated that even unsuccessful operations should have an impact/on azure storage accounts we see significant costs relating to this table)
MOST IMPORTANTLY: Could the absence of successful VACUUM operations affect the performance of queries on this table?
Are there any best practices related to this operation or any other more efficient way? (e.g. VACUUM on partitions of the table)

Kind Regards

Lakshay · ‎06-09-2023

It is difficult to say why the job kept running without taking a look at the logs. We, however, do recommend avoiding any concurrent runs at the same time as they will compete for the same bandwidth. This can hamper the performance of the vacuum.

You can refer to this kb document for best practices related to vacuum command:

https://kb.databricks.com/en_US/delta/vacuum-best-practices-on-delta-lake

View solution in original post

Lakshay · ‎06-08-2023

Hi @EDDatabricks EDDatabricks , Let me try to answer all the questions:

The log is not logical. It is the actual size of the log. If the size of the _delta_log is large, that is because of no. of delta versions created and retained in the table. By default, the delta Log rentention is 30 days. You can consider reducing this no. but this will affect the ability to timetravel to previous versions.
The time taken by vacuum depends on the no. of folders and files that are present in the table in total.
Vacuum deletes only the stale files that are present in the table i.e. Any files that are no longer referenced. If there is no significant reduction in size, it means that most of the files are still being used by the table.
In absence of vacuum commands, there are more files in the table. Hence, to run a select query, the job has to parse through more files and this can affect the performance of the query. You should run the optimize and vacuum commands frequently on the table.
Vacuum commands should be run frequently on the table and before running vacuum query, you should run the optimize query on the table. Optimize query compacts the data and reduces the no. of files in the table by combining the small files into large files. As a process, it generates stale files which vacuum command can be used to delete.

Also, in addition to this, here are some additional insights from the information provided:

Table data is 7.5 TB and there are 500K files which means the average file size is 25 MB which is actually very less. My understanding is that you have partition strategy in your table which is generating very small partitions which is not efficient.

It is a good practice to run optimize and vacuum command regularly. But from the information provided, it looks like you have not run the vacuum commands in a long time. As there is more work for to do here for vacuum and optimize, it will take more time initially. But once the workload decreases, the time taken will also come down.

Lakshay · ‎06-09-2023

It is difficult to say why the job kept running without taking a look at the logs. We, however, do recommend avoiding any concurrent runs at the same time as they will compete for the same bandwidth. This can hamper the performance of the vacuum.

You can refer to this kb document for best practices related to vacuum command:

https://kb.databricks.com/en_US/delta/vacuum-best-practices-on-delta-lake

Anonymous · ‎06-13-2023

Hi @EDDatabricks EDDatabricks

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

Databricks Community

Unable to perform VACUUM on Delta table

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!