Unable to perform VACUUM on Delta table

EDDatabricks
Databricks Partner

We have a table containing records from the last 2-3 years. The table size is around 7.5 TBytes (67 Billion rows).

Because there are periodic updates on historical records and daily optimizations of this table, we have tried repeatedly to execute a manual VACUUM operation on aforementioned table.

We have tried the following with no success:

  • SMALL VACUUMS on 2-3 days intervals
    • number of files reduced
    • similar memory
    • no successful completion or significant memory reduction
  • OPTIMIZATION & VACUUM
    • whole table optimization firstly
    • VACUUM on optimized table
    • no successful completion or significant memory reduction
  • VACUUM for several days/week
    • runtime of multiple days (> 3 weeks)
    • number of files reduced (6M to 500K files)
    • no actual memory reduction was observed (still > 7.5 TBytes)
    • no successful completion

During our analysis of the corresponding filesystem, > 1TByte of data was the _delta_log where versions of table are tracked.

The questions we have are:

  1. Whether the log size is logical or is there any issue with the state of the table
  2. If there is any way to estimate the actual time needed to perform the VACUUM operation (it is becoming increasingly costly to try such long operations)
  3. Why there is no significant reduction in memory (previous discussions with databricks support team indicated that even unsuccessful operations should have an impact/on azure storage accounts we see significant costs relating to this table)
  4. MOST IMPORTANTLY: Could the absence of successful VACUUM operations affect the performance of queries on this table?
  5. Are there any best practices related to this operation or any other more efficient way? (e.g. VACUUM on partitions of the table)

Kind Regards