making REORG TABLE to enable Iceberg Uniform more efficient and faster

JameDavi_51481 — Sat, 12 Jul 2025 18:48:54 GMT

I am upgrading a large number of tables for Iceberg / Uniform compatibility by running

REORG TABLE <tablename> APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION=2));

and finding that some tables take several hours to upgrade - presumably because they are re-writing all the parquet files, rather than just generating the Iceberg metadata.

Are there any configurations I can use to make this process more efficient? Or any guidelines for optimizing cluster parameters (number of nodes, size, etc) to make these sorts of operations faster? I'm not keen on spending a few weeks to get Iceberg working on my tables.

Re: making REORG TABLE to enable Iceberg Uniform more efficient and faster

sridharplv — Sat, 12 Jul 2025 19:08:10 GMT

HI @JameDavi_51481 , Hope you tried this approach for enabling iceberg metadata along with delta format :

ALTER TABLE internal_poc_iceberg.iceberg_poc.clickstream_gold_sink_dlt
SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.enableIcebergCompatV2' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg'
);

Please let me know otherwise. If you used it and still looking for fast reorg with complete rewrite you can tweak cluster settings or configuration to make it faster.

1. Use a cluster with high parallelism

Use a larger cluster (more worker nodes) with:
- High I/O throughput (EBS-optimized in AWS, or Premium SSD in Azure)
- High memory-to-core ratio (e.g., i3, r5d, m5d instances in AWS)
Try using photon-enabled clusters if available — Photon often improves performance of IO-heavy workloads.

2. Run upgrades in parallel

If you're upgrading multiple tables, batch them in parallel using job clusters or workflows.

topic making REORG TABLE to enable Iceberg Uniform more efficient and faster in Data Engineering

making REORG TABLE to enable Iceberg Uniform more efficient and faster

Re: making REORG TABLE to enable Iceberg Uniform more efficient and faster

1. Use a cluster with high parallelism

2. Run upgrades in parallel

3. Use autoscaling job clusters