cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

VACUUM with Azure Storage Inventory Report is not working

YuriS
New Contributor II

Could someone please advise regarding VACUUM with Azure Storage Inventory Report as i have failed to make it work.

DBR 15.4 LTS, VACUUM command is being run with USING INVENTORY clause, as follows:

VACUUM schema.table USING INVENTORY (
select 'https://xxx.blob.core.windows.net/' || ir.Name as path,
          ir.`Content-Length` as length,
          case when ir.hdi_isfolder is null then false else ir.hdi_isfolder end as isDir,
          ir.`Last-Modified`  as modificationTime
    from inventory_raw ir
   where ...
)
 
it does not fail, however it does not VACUUM anything. 
Describe history output is as follows:

VACUUM END {"numDeletedFiles":"0","numVacuumedDirectories":"1"}
VACUUM START {"numFilesToDelete":"0","sizeOfDataToDelete":"0"}
 
At the same time VACUUM without INVENTORY clause, but with DRY RUN option shows 1k files to be vacuumed.
Can someone also advise if that USING INVENTORY clause really works on Databricks' version of Delta - i failed to find any information in official Databricks docs, only here: https://delta.io/blog/efficient-delta-vacuum/
 
Thank you

 

2 REPLIES 2

Brahmareddy
Honored Contributor III

Hi YuriS,

How are you doing today?, As per my understanding, you're absolutely right to look into the USING INVENTORY clause for VACUUM, especially when dealing with large storage footprints. The tricky part is that while this feature is part of open-source Delta Lake, it's not yet fully supported or documented in Databricks' managed Delta implementationโ€”which explains why you're seeing unexpected results and not finding official documentation in the Databricks docs.

In your case, the VACUUM command runs but doesn't delete anything because Databricks isn't actually wired to act on external inventory metadata yet, even though it parses the syntax without error. Thatโ€™s why your dry-run vacuum (without inventory) shows 1K files ready to be cleaned, but the inventory-based vacuum does nothingโ€”it's not using the external inventory report in a meaningful way within Databricks at this time.

So for now, I'd suggest sticking with the standard VACUUM approach in Databricks, possibly using DRY RUN regularly to monitor what would be removed. You could also automate this with a custom retention window to stay efficient. Hopefully, Databricks adds support for inventory-based vacuuming soon, especially since itโ€™s great for large cloud storage environmentsโ€”but as of now, itโ€™s not officially supported on the managed platform. Let me know if youโ€™d like help setting up a more efficient vacuum strategy based on what Databricks does support today!

Regards,

Brahma

YuriS
New Contributor II

After additional investigation it turned out the proper "fully-qualified-URL" path should be

'dbfs:/mnt/...'

'dbfs:/mnt/{endpoint}/' || ir.Name as path,

and not 

'https://xxx.blob.core.windows.net/' || ir.Name as path,

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now