- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-28-2025 08:49 AM
Good afternoon everyone,
I’m writing this post to see if anyone has encountered this problem and if there is a way to resolve it or understand why it happens. I’m working in a Databricks Runtime 15.4 LTS environment, which includes Apache Spark 3.5.0 and Scala 2.12. The target database is SQL Server, and the table in question has a composite primary key that prevents inserting duplicates. Part of its definition is as follows: the primary key is composed of the columns Cod1, Cod2, and Cod3, and the table settings include options such as STATISTICS_NORECOMPUTE set to OFF, IGNORE_DUP_KEY set to OFF, and OPTIMIZE_FOR_SEQUENTIAL_KEY set to ON.
The process consists of two main parts. First, I have a DataFrame called today_df, which is generated from several Delta tables and undergoes various transformations. Second, I have the yesterday_df table, which is obtained from the SQL Server database to synchronize both sources. The goal is to compare the two sources to detect differences. To achieve this, I generate a hash of the values in each row, excluding the primary key columns, and use this to identify new records, deleted records, and records that need to be updated.
The problem occurs when trying to insert the new records into SQL Server. I use a standard Spark JDBC write with append mode, specifying the SQL Server driver, connection URL, user, password, target table, batch size, number of partitions, isolation level, and enabling statement rewriting and compression. Despite this, I receive a PrimaryKeyViolation error, indicating that inserts_df contains a record with a primary key that already exists in the database. However, upon reviewing the data in inserts_df, I cannot find any record with a combination of keys matching an existing row in SQL Server.
Py4JJavaError: An error occurred while calling oXXX.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage XXX.X failed 4 times, most recent failure: Lost task 8.X in stage XXX.X (TID XXX) (XXX.XXX.XX.XX executor 0): java.sql.BatchUpdateException: Violation of PRIMARY KEY constraint 'PK_TableName_Version'. Cannot insert duplicate key in object 'Schema.TableName_Version'. The duplicate key value is (Key1Value, Key2Value, Key3Value).
at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeBatch(SQLServerPreparedStatement.java:XXXX)
I found an article in the Databricks knowledge base describing a similar issue, which explains that this can occur when Spark retries writing a partition after a successful commit, potentially creating duplicates in the target. The recommended solution is to create a temporary table in the database and then perform a MERGE with the final table.
In my case, I would like to know what alternatives exist to resolve this problem, since the target table is very large, over 40 million records, and I want to avoid using overwrite mode or a temporary table for performance and space reasons.
I would also like to better understand why this behavior occurs, since I cannot locate the record that supposedly already exists in the database. Everything suggests that the conflict happens during one of the partitions writing from inserts_df, but the record does not appear in the DataFrame itself.
I would greatly appreciate it if anyone who has experienced this same error could share how they resolved it, or if there is any configuration or best practice to avoid such duplicates during the JDBC write process.
Thank you very much in advance for any guidance or experience you can share.