Databricks Community

eyalholzmann · 4 weeks ago

I'm working with Delta tables using the Iceberg Uniform feature to enable Iceberg-compatible reads. I’m trying to understand how metadata cleanup works in this setup.

Specifically, does the VACUUM operation—which removes old Delta Lake metadata based on the retention period—also trigger deletion of the corresponding Iceberg metadata? Or is Iceberg metadata managed separately and requires its own cleanup process?

Louis_Frolio · 4 weeks ago

Great question @eyalholzmann ,

In Databricks Delta Lake with the Iceberg Uniform feature, VACUUM operations on the Delta table do NOT automatically clean up the corresponding Iceberg metadata. The two metadata layers are managed separately, and understanding this distinction is critical to avoid potential data corruption and query failures.

How Metadata Cleanup Works

Delta Lake VACUUM Behavior

When you run VACUUM on a Delta table with Iceberg Uniform enabled, the operation removes Parquet data files that are no longer referenced by Delta Lake metadata based on the retention period you specify. This standard Delta Lake cleanup process only considers the Delta transaction log when determining which files to remove.

Iceberg Metadata Management

The Iceberg metadata generated by UniForm is stored separately in the table directory under the `/metadata/` subdirectory as versioned JSON files following the pattern `<table-path>/metadata/<version-number>-<uuid>.metadata.json`. These metadata files track their own snapshots and manifest files independently from Delta's transaction log.

Critical Risk: Metadata Synchronization

A significant operational concern exists when using path-based Iceberg clients: users may encounter errors when querying Iceberg tables using out-of-date metadata versions after VACUUM removes Parquet data files from the Delta table. This happens because:

- The Iceberg metadata files may still reference data files that VACUUM has removed
- Path-based Iceberg clients require manual updating and refreshing of metadata JSON paths to read current table versions
- There's no automatic cleanup mechanism that removes stale Iceberg metadata when corresponding data files are vacuumed

Recommended Approach

To manage this setup effectively:

1. Enable Predictive Optimization: Databricks recommends enabling predictive optimization for Unity Catalog managed tables, which automatically handles VACUUM operations and maintenance tasks

2. Monitor Metadata Status: Use `DESCRIBE EXTENDED table_name` to check the `converted_delta_version` and `converted_delta_timestamp` fields to verify which Delta version corresponds to the current Iceberg metadata

3. Manual Metadata Refresh: If metadata becomes stale, use `MSCK REPAIR TABLE <table-name> SYNC METADATA` to manually trigger Iceberg metadata regeneration

4. Coordinate Retention Periods: Ensure your VACUUM retention period is long enough to account for any lag in Iceberg metadata updates and client access patterns

The key takeaway is that Iceberg metadata cleanup is not automatic when running VACUUM, and you must carefully manage metadata synchronization to prevent Iceberg clients from attempting to read files that have been removed by Delta's cleanup processes.

Hope this helps, Louis.

View solution in original post

Louis_Frolio · 4 weeks ago

Great question @eyalholzmann ,

In Databricks Delta Lake with the Iceberg Uniform feature, VACUUM operations on the Delta table do NOT automatically clean up the corresponding Iceberg metadata. The two metadata layers are managed separately, and understanding this distinction is critical to avoid potential data corruption and query failures.

How Metadata Cleanup Works

Delta Lake VACUUM Behavior

When you run VACUUM on a Delta table with Iceberg Uniform enabled, the operation removes Parquet data files that are no longer referenced by Delta Lake metadata based on the retention period you specify. This standard Delta Lake cleanup process only considers the Delta transaction log when determining which files to remove.

Iceberg Metadata Management

The Iceberg metadata generated by UniForm is stored separately in the table directory under the `/metadata/` subdirectory as versioned JSON files following the pattern `<table-path>/metadata/<version-number>-<uuid>.metadata.json`. These metadata files track their own snapshots and manifest files independently from Delta's transaction log.

Critical Risk: Metadata Synchronization

A significant operational concern exists when using path-based Iceberg clients: users may encounter errors when querying Iceberg tables using out-of-date metadata versions after VACUUM removes Parquet data files from the Delta table. This happens because:

- The Iceberg metadata files may still reference data files that VACUUM has removed
- Path-based Iceberg clients require manual updating and refreshing of metadata JSON paths to read current table versions
- There's no automatic cleanup mechanism that removes stale Iceberg metadata when corresponding data files are vacuumed

Recommended Approach

To manage this setup effectively:

1. Enable Predictive Optimization: Databricks recommends enabling predictive optimization for Unity Catalog managed tables, which automatically handles VACUUM operations and maintenance tasks

2. Monitor Metadata Status: Use `DESCRIBE EXTENDED table_name` to check the `converted_delta_version` and `converted_delta_timestamp` fields to verify which Delta version corresponds to the current Iceberg metadata

3. Manual Metadata Refresh: If metadata becomes stale, use `MSCK REPAIR TABLE <table-name> SYNC METADATA` to manually trigger Iceberg metadata regeneration

4. Coordinate Retention Periods: Ensure your VACUUM retention period is long enough to account for any lag in Iceberg metadata updates and client access patterns

The key takeaway is that Iceberg metadata cleanup is not automatic when running VACUUM, and you must carefully manage metadata synchronization to prevent Iceberg clients from attempting to read files that have been removed by Delta's cleanup processes.

Hope this helps, Louis.

eyalholzmann · 3 weeks ago

Which actions should be used to clean up and maintain Iceberg metadata?

expireSnapshots: Is it recommended to delete old snapshots using the same retention period as the Delta table?
deleteOrphanFiles: This deletes unreferenced Iceberg metadata as well as unreferenced data files. Is it safe to run this when some data might still be referenced by Delta metadata?
rewriteManifests: This action rewrites manifest files for optimization but also creates a new snapshot. Should this be executed?

Louis_Frolio · 3 weeks ago

Here’s how to approach cleaning and maintaining Apache Iceberg metadata on Databricks, and how it differs from Delta workflows.

First, know your table type

For Unity Catalog–managed Iceberg tables, Databricks runs table maintenance for you (predictive optimization) — including snapshot expiration and orphan-file cleanup — so you rarely need to run these actions manually.
For foreign/external Iceberg tables (or if you intentionally disable automation), you may choose to run specific Iceberg maintenance procedures yourself.

Action-by-action guidance

expireSnapshots

Yes — expireSnapshots is recommended to bound your time-travel/rollback window and keep metadata compact. On managed Iceberg, UC automates snapshot expiration; choose manual retention only when you need tighter control.
Don’t assume the same retention as your Delta VACUUM. Set Iceberg’s retention to match your operational needs (time travel, audit requirements, longest-running jobs), independent of Delta’s retention checks. If you do run it manually, you can use Iceberg procedures, for example:
SQL (Iceberg proc)
CALL <catalog>.system.expire_snapshots(table => 'db.tbl', older_than => CURRENT_TIMESTAMP - INTERVAL 7 DAYS);

or (client-dependent syntax)
ALTER TABLE db.tbl EXECUTE expire_snapshots(retention_threshold => '7d');

deleteOrphanFiles

Only run deleteOrphanFiles when the table’s storage location is used exclusively by Iceberg and you’re certain those files aren’t referenced elsewhere. If the same Parquet files serve multiple formats (e.g., Delta with Iceberg reads/UniForm), deleting “orphans” from Iceberg’s perspective can break Delta readers that still reference them. In short: not safe if Delta still references those files.

Why: Databricks supports workflows where a single copy of Parquet data is served to multiple formats; removing files because they’re “unreferenced” in Iceberg can invalidate concurrent readers in Delta or path-based Iceberg clients until metadata is refreshed.

rewriteManifests

rewriteManifests is safe and often beneficial — it rewrites manifest files for planning efficiency and creates a new snapshot (data remains unchanged). On managed Iceberg, UC periodically optimizes metadata for you; consider manual rewrites for external tables or after heavy streaming/append workloads that produce many small manifests.
Practical tips (when you run it yourself): target specific large or fragmented manifests instead of rewriting all; avoid Spark executor memory pressure by disabling aggressive caching during the operation (client-dependent).

Summary recommendations

On managed Iceberg: rely on UC’s automated maintenance; override manually only for special cases or compliance windows.
On external/foreign Iceberg:
- Use expireSnapshots regularly (based on business SLAs),
- Avoid deleteOrphanFiles if any other table/format could still reference the same files (including Delta),
- Run rewriteManifests periodically to keep planning efficient, especially for streaming/high-churn tables.

Cheers, Louis.

Databricks Community

Does VACUUM on Delta Lake also clean Iceberg metadata when using Iceberg Uniform feature?

How Metadata Cleanup Works

Delta Lake VACUUM Behavior

Iceberg Metadata Management

Critical Risk: Metadata Synchronization

Recommended Approach

How Metadata Cleanup Works

Delta Lake VACUUM Behavior

Iceberg Metadata Management

Critical Risk: Metadata Synchronization

Recommended Approach

First, know your table type

Action-by-action guidance

expireSnapshots

deleteOrphanFiles

rewriteManifests

Summary recommendations

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐