cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Can Spark JDBC create duplicate records

User16869510359
Esteemed Contributor

Is it transaction safe?Does it ensure atomicity

1 ACCEPTED SOLUTION

Accepted Solutions

User16869510359
Esteemed Contributor

Atomicity is ensured at a task level and not at a stage level. For any reason, if the stage is getting retried, the tasks which already completed the write operation will re-run and cause duplicate records. This is expected by design.

When Apache Spark performs a JDBC write, one partition of the DataFrame is written to a SQL table. This is generally done as a single JDBC transaction, in order to avoid repeatedly inserting data. However, if the transaction fails after the commit occurs, but before the final stage completes, it is possible for duplicate data to be copied into the SQL table.

Verify that speculative execution is disabled in your Spark configuration: spark.speculation false. This is disabled by default. These configurations increase the possibility of retries.

Creating a temporary table to buffer the data and then MERGING it to the actual table could be a potential workaround.

View solution in original post

1 REPLY 1

User16869510359
Esteemed Contributor

Atomicity is ensured at a task level and not at a stage level. For any reason, if the stage is getting retried, the tasks which already completed the write operation will re-run and cause duplicate records. This is expected by design.

When Apache Spark performs a JDBC write, one partition of the DataFrame is written to a SQL table. This is generally done as a single JDBC transaction, in order to avoid repeatedly inserting data. However, if the transaction fails after the commit occurs, but before the final stage completes, it is possible for duplicate data to be copied into the SQL table.

Verify that speculative execution is disabled in your Spark configuration: spark.speculation false. This is disabled by default. These configurations increase the possibility of retries.

Creating a temporary table to buffer the data and then MERGING it to the actual table could be a potential workaround.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.