cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Does VACUUM on Delta Lake also clean Iceberg metadata when using Iceberg Uniform feature?

eyalholzmann
New Contributor

I'm working with Delta tables using the Iceberg Uniform feature to enable Iceberg-compatible reads. Iโ€™m trying to understand how metadata cleanup works in this setup.

Specifically, does the VACUUM operationโ€”which removes old Delta Lake metadata based on the retention periodโ€”also trigger deletion of the corresponding Iceberg metadata? Or is Iceberg metadata managed separately and requires its own cleanup process?

3 REPLIES 3

Louis_Frolio
Databricks Employee
Databricks Employee

Great question @eyalholzmann , 

In Databricks Delta Lake with the Iceberg Uniform feature, VACUUM operations on the Delta table do NOT automatically clean up the corresponding Iceberg metadata. The two metadata layers are managed separately, and understanding this distinction is critical to avoid potential data corruption and query failures.

How Metadata Cleanup Works

Delta Lake VACUUM Behavior

When you run VACUUM on a Delta table with Iceberg Uniform enabled, the operation removes Parquet data files that are no longer referenced by Delta Lake metadata based on the retention period you specify. This standard Delta Lake cleanup process only considers the Delta transaction log when determining which files to remove.

Iceberg Metadata Management

The Iceberg metadata generated by UniForm is stored separately in the table directory under the `/metadata/` subdirectory as versioned JSON files following the pattern `<table-path>/metadata/<version-number>-<uuid>.metadata.json`. These metadata files track their own snapshots and manifest files independently from Delta's transaction log.

Critical Risk: Metadata Synchronization

A significant operational concern exists when using path-based Iceberg clients: users may encounter errors when querying Iceberg tables using out-of-date metadata versions after VACUUM removes Parquet data files from the Delta table. This happens because:

- The Iceberg metadata files may still reference data files that VACUUM has removed
- Path-based Iceberg clients require manual updating and refreshing of metadata JSON paths to read current table versions
- There's no automatic cleanup mechanism that removes stale Iceberg metadata when corresponding data files are vacuumed

Recommended Approach

To manage this setup effectively:

1. Enable Predictive Optimization: Databricks recommends enabling predictive optimization for Unity Catalog managed tables, which automatically handles VACUUM operations and maintenance tasks

2. Monitor Metadata Status: Use `DESCRIBE EXTENDED table_name` to check the `converted_delta_version` and `converted_delta_timestamp` fields to verify which Delta version corresponds to the current Iceberg metadata

3. Manual Metadata Refresh: If metadata becomes stale, use `MSCK REPAIR TABLE <table-name> SYNC METADATA` to manually trigger Iceberg metadata regeneration

4. Coordinate Retention Periods: Ensure your VACUUM retention period is long enough to account for any lag in Iceberg metadata updates and client access patterns

The key takeaway is that Iceberg metadata cleanup is not automatic when running VACUUM, and you must carefully manage metadata synchronization to prevent Iceberg clients from attempting to read files that have been removed by Delta's cleanup processes.

Hope this helps, Louis.

eyalholzmann
New Contributor

Which actions should be used to clean up and maintain Iceberg metadata?

  • expireSnapshots: Is it recommended to delete old snapshots using the same retention period as the Delta table?

  • deleteOrphanFiles: This deletes unreferenced Iceberg metadata as well as unreferenced data files. Is it safe to run this when some data might still be referenced by Delta metadata?

  • rewriteManifests: This action rewrites manifest files for optimization but also creates a new snapshot. Should this be executed?

Louis_Frolio
Databricks Employee
Databricks Employee

Hereโ€™s how to approach cleaning and maintaining Apache Iceberg metadata on Databricks, and how it differs from Delta workflows.

First, know your table type

  • For Unity Catalogโ€“managed Iceberg tables, Databricks runs table maintenance for you (predictive optimization) โ€” including snapshot expiration and orphan-file cleanup โ€” so you rarely need to run these actions manually.

  • For foreign/external Iceberg tables (or if you intentionally disable automation), you may choose to run specific Iceberg maintenance procedures yourself.


Action-by-action guidance

expireSnapshots

  • Yes โ€” expireSnapshots is recommended to bound your time-travel/rollback window and keep metadata compact. On managed Iceberg, UC automates snapshot expiration; choose manual retention only when you need tighter control.

  • Donโ€™t assume the same retention as your Delta VACUUM. Set Icebergโ€™s retention to match your operational needs (time travel, audit requirements, longest-running jobs), independent of Deltaโ€™s retention checks. If you do run it manually, you can use Iceberg procedures, for example:
    SQL (Iceberg proc)
    CALL <catalog>.system.expire_snapshots(table => 'db.tbl', older_than => CURRENT_TIMESTAMP - INTERVAL 7 DAYS);

     
    or (client-dependent syntax)
    ALTER TABLE db.tbl EXECUTE expire_snapshots(retention_threshold => '7d');

deleteOrphanFiles

  • Only run deleteOrphanFiles when the tableโ€™s storage location is used exclusively by Iceberg and youโ€™re certain those files arenโ€™t referenced elsewhere. If the same Parquet files serve multiple formats (e.g., Delta with Iceberg reads/UniForm), deleting โ€œorphansโ€ from Icebergโ€™s perspective can break Delta readers that still reference them. In short: not safe if Delta still references those files.

    Why: Databricks supports workflows where a single copy of Parquet data is served to multiple formats; removing files because theyโ€™re โ€œunreferencedโ€ in Iceberg can invalidate concurrent readers in Delta or path-based Iceberg clients until metadata is refreshed.

     

rewriteManifests

  • rewriteManifests is safe and often beneficial โ€” it rewrites manifest files for planning efficiency and creates a new snapshot (data remains unchanged). On managed Iceberg, UC periodically optimizes metadata for you; consider manual rewrites for external tables or after heavy streaming/append workloads that produce many small manifests.

  • Practical tips (when you run it yourself): target specific large or fragmented manifests instead of rewriting all; avoid Spark executor memory pressure by disabling aggressive caching during the operation (client-dependent).


Summary recommendations

  • On managed Iceberg: rely on UCโ€™s automated maintenance; override manually only for special cases or compliance windows.

  • On external/foreign Iceberg:

    • Use expireSnapshots regularly (based on business SLAs),
    • Avoid deleteOrphanFiles if any other table/format could still reference the same files (including Delta),
    • Run rewriteManifests periodically to keep planning efficient, especially for streaming/high-churn tables.
       

Cheers, Louis.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now