cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

update on iceberg table creating duplicate records

stemill
New Contributor II

We are using databricks to connect to a glue catalog which contains iceberg tables. We are using DBR 17.2 and adding the jars 

org.apache.iceberg:iceberg-spark-runtime-4.0_2.13:1.10.0
org.apache.iceberg:iceberg-aws-bundle:1.10.0

the spark config is then set to 

spark.sql.catalog.spark_catalog: org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.spark_catalog.catalog-impl: org.apache.iceberg.aws.glue.GlueCatalog
spark.sql.catalog.spark_catalog.io-impl: org.apache.iceberg.aws.s3.S3FileIO
spark.sql.extensions: org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions spark.databricks.hive.metastore.glueCatalog.enabled true

This allows read write access to the iceberg tables using spark_catalog.<schema>.<table_name>

This is largely working fine however we have recently run into a problem where an update statement is causing duplicate rows to be created. After repeated executions this quickly becomes a huge problem with the table growing exponentially in size. To be clear this is a plain UPDATE <table> SET <fields> WHERE <filter>, it is NOT a merge statement.

We are in a worksapce with Unity Catalog enabled (but we are not really using it). Looking in the snapshots data for the table in question we can see all the operations have

iceberg-version=Apache Iceberg unspecified (commit 7dbafb438ee1e68d0047bebcb587265d7d87d8a1)

When i have tried this using the same jars on OSS spark the iceberg-version accurately reflect the version of the jars. I cannot get the duplication of rows to occur using OSS.

Not every update seems to cause this problem it is only when there has been an INSERT statement between the last update statement. I'm not sure if this is something to with UnityCatalog or where to look next.


2 REPLIES 2

mderela
Contributor

Interesting observation about the INSERT before UPDATE pattern, that’s actually a useful clue for narrowing this down.
The iceberg-version=unspecified is worth investigating separately - maybe it for some reason ignoring your jars, but this only the hypothesis. On the duplicates though: if this were purely a format-level issue with position deletes not being supported, you’d expect duplicates on every UPDATE. The fact that it’s intermittent and specifically tied to a preceding INSERT points more toward a concurrency or snapshot isolation issue: the UPDATE may not be seeing the correct snapshot state after the INSERT commits, depending on how Glue catalog resolves the latest metadata between operations.
This would explain the inconsistency: it’s timing and snapshot visibility dependent, not a deterministic failure.
Worth checking: what Iceberg spec version is your table using? SELECT * FROM your_table.snapshots and look at the summary field. And have you tried using a separate catalog namespace like spark.sql.catalog.glue instead of overriding spark_catalog? With UC enabled, overriding spark_catalog is a known source of catalog resolution conflicts that could cause exactly this kind of intermittent snapshot visibility problem.

stemill
New Contributor II

It's table format version 2. I have tried using a different catalog name but this just doesn't seem to work in a workspace with unity catalog enabled, only spark_catalog works. We are starting to think we've made a mistake creating our workspace with unity catalog enabled as we cannot access any of the iceberg metadata or procedures using four part naming.

I have created a test workspace without unity catalog and tried an update on a different table. Here we can use anything we choose as the catalog name. However we still see the iceberg-version=unspecified.