cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Can Spark JDBC create duplicate records

brickster_2018
Databricks Employee
Databricks Employee

Is it transaction safe?Does it ensure atomicity

1 ACCEPTED SOLUTION

Accepted Solutions

brickster_2018
Databricks Employee
Databricks Employee

Atomicity is ensured at a task level and not at a stage level. For any reason, if the stage is getting retried, the tasks which already completed the write operation will re-run and cause duplicate records. This is expected by design.

When Apache Spark performs a JDBC write, one partition of the DataFrame is written to a SQL table. This is generally done as a single JDBC transaction, in order to avoid repeatedly inserting data. However, if the transaction fails after the commit occurs, but before the final stage completes, it is possible for duplicate data to be copied into the SQL table.

Verify that speculative execution is disabled in your Spark configuration: spark.speculation false. This is disabled by default. These configurations increase the possibility of retries.

Creating a temporary table to buffer the data and then MERGING it to the actual table could be a potential workaround.

View solution in original post

1 REPLY 1

brickster_2018
Databricks Employee
Databricks Employee

Atomicity is ensured at a task level and not at a stage level. For any reason, if the stage is getting retried, the tasks which already completed the write operation will re-run and cause duplicate records. This is expected by design.

When Apache Spark performs a JDBC write, one partition of the DataFrame is written to a SQL table. This is generally done as a single JDBC transaction, in order to avoid repeatedly inserting data. However, if the transaction fails after the commit occurs, but before the final stage completes, it is possible for duplicate data to be copied into the SQL table.

Verify that speculative execution is disabled in your Spark configuration: spark.speculation false. This is disabled by default. These configurations increase the possibility of retries.

Creating a temporary table to buffer the data and then MERGING it to the actual table could be a potential workaround.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group