Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-08-2023 06:03 AM
We have a table containing records from the last 2-3 years. The table size is around 7.5 TBytes (67 Billion rows).
Because there are periodic updates on historical records and daily optimizations of this table, we have tried repeatedly to execute a manual VACUUM operation on aforementioned table.
We have tried the following with no success:
- SMALL VACUUMS on 2-3 days intervals
- number of files reduced
- similar memory
- no successful completion or significant memory reduction
- OPTIMIZATION & VACUUM
- whole table optimization firstly
- VACUUM on optimized table
- no successful completion or significant memory reduction
- VACUUM for several days/week
- runtime of multiple days (> 3 weeks)
- number of files reduced (6M to 500K files)
- no actual memory reduction was observed (still > 7.5 TBytes)
- no successful completion
During our analysis of the corresponding filesystem, > 1TByte of data was the _delta_log where versions of table are tracked.
The questions we have are:
- Whether the log size is logical or is there any issue with the state of the table
- If there is any way to estimate the actual time needed to perform the VACUUM operation (it is becoming increasingly costly to try such long operations)
- Why there is no significant reduction in memory (previous discussions with databricks support team indicated that even unsuccessful operations should have an impact/on azure storage accounts we see significant costs relating to this table)
- MOST IMPORTANTLY: Could the absence of successful VACUUM operations affect the performance of queries on this table?
- Are there any best practices related to this operation or any other more efficient way? (e.g. VACUUM on partitions of the table)
Kind Regards
Labels:
- Labels:
-
VACUUM Operation