cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Can Spark JDBC create duplicate records

User16869510359
Esteemed Contributor

Is it transaction safe?Does it ensure atomicity

1 ACCEPTED SOLUTION

Accepted Solutions

User16869510359
Esteemed Contributor

Atomicity is ensured at a task level and not at a stage level. For any reason, if the stage is getting retried, the tasks which already completed the write operation will re-run and cause duplicate records. This is expected by design.

When Apache Spark performs a JDBC write, one partition of the DataFrame is written to a SQL table. This is generally done as a single JDBC transaction, in order to avoid repeatedly inserting data. However, if the transaction fails after the commit occurs, but before the final stage completes, it is possible for duplicate data to be copied into the SQL table.

Verify that speculative execution is disabled in your Spark configuration: spark.speculation false. This is disabled by default. These configurations increase the possibility of retries.

Creating a temporary table to buffer the data and then MERGING it to the actual table could be a potential workaround.

View solution in original post

1 REPLY 1

User16869510359
Esteemed Contributor

Atomicity is ensured at a task level and not at a stage level. For any reason, if the stage is getting retried, the tasks which already completed the write operation will re-run and cause duplicate records. This is expected by design.

When Apache Spark performs a JDBC write, one partition of the DataFrame is written to a SQL table. This is generally done as a single JDBC transaction, in order to avoid repeatedly inserting data. However, if the transaction fails after the commit occurs, but before the final stage completes, it is possible for duplicate data to be copied into the SQL table.

Verify that speculative execution is disabled in your Spark configuration: spark.speculation false. This is disabled by default. These configurations increase the possibility of retries.

Creating a temporary table to buffer the data and then MERGING it to the actual table could be a potential workaround.