Databricks Community

YuriS · ‎04-18-2025

Could someone please advise regarding VACUUM with Azure Storage Inventory Report as i have failed to make it work.

DBR 15.4 LTS, VACUUM command is being run with USING INVENTORY clause, as follows:

VACUUM schema.table USING INVENTORY (
select 'https://xxx.blob.core.windows.net/' || ir.Name as path,
          ir.`Content-Length` as length,
          case when ir.hdi_isfolder is null then false else ir.hdi_isfolder end as isDir,
          ir.`Last-Modified`  as modificationTime
    from inventory_raw ir
   where ...
)

it does not fail, however it does not VACUUM anything.

Describe history output is as follows:

VACUUM END {"numDeletedFiles":"0","numVacuumedDirectories":"1"}
VACUUM START {"numFilesToDelete":"0","sizeOfDataToDelete":"0"}

At the same time VACUUM without INVENTORY clause, but with DRY RUN option shows 1k files to be vacuumed.

Can someone also advise if that USING INVENTORY clause really works on Databricks' version of Delta - i failed to find any information in official Databricks docs, only here: https://delta.io/blog/efficient-delta-vacuum/

Thank you

Brahmareddy · ‎04-18-2025

Hi YuriS,

How are you doing today?, As per my understanding, you're absolutely right to look into the USING INVENTORY clause for VACUUM, especially when dealing with large storage footprints. The tricky part is that while this feature is part of open-source Delta Lake, it's not yet fully supported or documented in Databricks' managed Delta implementation—which explains why you're seeing unexpected results and not finding official documentation in the Databricks docs.

In your case, the VACUUM command runs but doesn't delete anything because Databricks isn't actually wired to act on external inventory metadata yet, even though it parses the syntax without error. That’s why your dry-run vacuum (without inventory) shows 1K files ready to be cleaned, but the inventory-based vacuum does nothing—it's not using the external inventory report in a meaningful way within Databricks at this time.

So for now, I'd suggest sticking with the standard VACUUM approach in Databricks, possibly using DRY RUN regularly to monitor what would be removed. You could also automate this with a custom retention window to stay efficient. Hopefully, Databricks adds support for inventory-based vacuuming soon, especially since it’s great for large cloud storage environments—but as of now, it’s not officially supported on the managed platform. Let me know if you’d like help setting up a more efficient vacuum strategy based on what Databricks does support today!

Regards,

Brahma

YuriS · ‎04-22-2025

After additional investigation it turned out the proper "fully-qualified-URL" path should be

'dbfs:/mnt/...'

'dbfs:/mnt/{endpoint}/' || ir.Name as path,

and not

'https://xxx.blob.core.windows.net/' || ir.Name as path,

Databricks Community

VACUUM with Azure Storage Inventory Report is not working

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples