โ03-13-2026 06:29 AM - edited โ03-13-2026 06:30 AM
We are using databricks to connect to a glue catalog which contains iceberg tables. We are using DBR 17.2 and adding the jars
org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0
org.apache.iceberg:iceberg-aws-bundle:1.10.0
the spark config is then set to
spark.sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.spark_catalog.catalog-impl: org.apache.iceberg.aws.glue.GlueCatalog
spark.sql.catalog.spark_catalog.io-impl: org.apache.iceberg.aws.s3.S3FileIO
spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions spark.databricks.hive.metastore.glueCatalog.enabled true
This allows read write access to the iceberg tables using spark_catalog.<schema>.<table_name>
This is largely working fine however we have recently run into a problem where an update statement is causing duplicate rows to be created. After repeated executions this quickly becomes a huge problem with the table growing exponentially in size. To be clear this is a plain UPDATE <table> SET <fields> WHERE <filter>, it is NOT a merge statement.
We are in a worksapce with Unity Catalog enabled (but we are not really using it). Looking in the snapshots data for the table in question we can see all the operations have
iceberg-version=Apache Iceberg unspecified (commit 7dbafb438ee1e68d0047bebcb587265d7d87d8a1)
When i have tried this using the same jars on OSS spark the iceberg-version accurately reflect the version of the jars. I cannot get the duplication of rows to occur using OSS.
Not every update seems to cause this problem it is only when there has been an INSERT statement between the last update statement. I'm not sure if this is something to with UnityCatalog or where to look next.
โ03-15-2026 01:04 AM
Interesting observation about the INSERT before UPDATE pattern, thatโs actually a useful clue for narrowing this down.
The iceberg-version=unspecified is worth investigating separately - maybe it for some reason ignoring your jars, but this only the hypothesis. On the duplicates though: if this were purely a format-level issue with position deletes not being supported, youโd expect duplicates on every UPDATE. The fact that itโs intermittent and specifically tied to a preceding INSERT points more toward a concurrency or snapshot isolation issue: the UPDATE may not be seeing the correct snapshot state after the INSERT commits, depending on how Glue catalog resolves the latest metadata between operations.
This would explain the inconsistency: itโs timing and snapshot visibility dependent, not a deterministic failure.
Worth checking: what Iceberg spec version is your table using? SELECT * FROM your_table.snapshots and look at the summary field. And have you tried using a separate catalog namespace like spark.sql.catalog.glue instead of overriding spark_catalog? With UC enabled, overriding spark_catalog is a known source of catalog resolution conflicts that could cause exactly this kind of intermittent snapshot visibility problem.
โ03-15-2026 12:45 PM
It's table format version 2. I have tried using a different catalog name but this just doesn't seem to work in a workspace with unity catalog enabled, only spark_catalog works. We are starting to think we've made a mistake creating our workspace with unity catalog enabled as we cannot access any of the iceberg metadata or procedures using four part naming.
I have created a test workspace without unity catalog and tried an update on a different table. Here we can use anything we choose as the catalog name. However we still see the iceberg-version=unspecified.
โ03-25-2026 08:16 AM
Hi @stemill ,
The way of connecting to Iceberg tables managed by Glue catalog that you described is not officially supported. Because spark_catalog is not a generic catalog slot โ itโs a special, tightlyโwired session catalog with a lot of assumptions baked into the runtime. Overriding it with an Iceberg SparkCatalog breaks those assumptions.
It's important to define which catalog is the central data catalog: Glue or Unity Catalog?
If it's Glue, then you should check what is the official way of writing to a Glue-managed Iceberg tables from an external writer (Databricks). It usually goes through an Iceberg REST API.
If it's Unity Catalog, then you should convert Glue tables to Unity Catalog managed tables (the data can stay in the same S3 bucket). Then, they can be accessed from external readers/writers via the Unity Catalog Iceberg REST API.
Hope it helps.
Best regards,
a month ago
Hi Aleksandra,
Thanks for the reply. I have tried using a different catalog slot (for example glue_catalog) but this does not work in a workspace with unity catalog enabled - it does work in a non unity catalog workspace.
The only catalog slot that works is overriding spark_catalog. This seems to work in terms of allowing reads and writes to the tables however we are running into these slightly odd issues.
The central catalog is Glue, we are not even really using Unity Catalog at all. It was enabled when the workspace was created and now it doesn't seem to be possible to turn it off.
We are tied to the Glue catalog because or our Athena usage and we absolutely need write access from Databricks. Would we be better just creating a new workspace which doesn't have Unity Catalog enabled?
Kind regards
a month ago
@stemill ,
The approach of overriding Spark configs to "side load" an Iceberg catalog is not officially supported (i.e., there is no guarantee that it will work and not corrupt your data).
I don't think it's possible to create a workspace without Unity Catalog now (it's anyways not recommended).
You can do it another way around:
Hope it helps.
Best regards,
a month ago
We need to be able to read and write from the tables from both Databricks and Athena. This was the reason for choosing Iceberg tables - one datastore, multiple compute engines. It now feels like all the catalog owners are putting up restrictions by making things such thay if you use their catalog you can only write via their engine.
Are Databricks completely getting rid of non unity catalog workspaces?
a month ago
@stemill ,
I understand the challenge. Unity Catalog is integrated into Databricks platform - and it's not possible to bypass it. Also, using Glue and Unity Catalog means that you will have 2 catalogs governing the same data - this is not a recommended pattern.
Perhaps another option you could try is to set up external Iceberg tables in Glue and map their S3 location to external tables in Unity Catalog?
Best regards,