Databricks Community

brickster_2018 · ‎06-25-2021

Is it transaction safe?Does it ensure atomicity

brickster_2018 · ‎06-25-2021

Atomicity is ensured at a task level and not at a stage level. For any reason, if the stage is getting retried, the tasks which already completed the write operation will re-run and cause duplicate records. This is expected by design.

When Apache Spark performs a JDBC write, one partition of the DataFrame is written to a SQL table. This is generally done as a single JDBC transaction, in order to avoid repeatedly inserting data. However, if the transaction fails after the commit occurs, but before the final stage completes, it is possible for duplicate data to be copied into the SQL table.

Verify that speculative execution is disabled in your Spark configuration: spark.speculation false. This is disabled by default. These configurations increase the possibility of retries.

Creating a temporary table to buffer the data and then MERGING it to the actual table could be a potential workaround.

View solution in original post

brickster_2018 · ‎06-25-2021

Atomicity is ensured at a task level and not at a stage level. For any reason, if the stage is getting retried, the tasks which already completed the write operation will re-run and cause duplicate records. This is expected by design.

When Apache Spark performs a JDBC write, one partition of the DataFrame is written to a SQL table. This is generally done as a single JDBC transaction, in order to avoid repeatedly inserting data. However, if the transaction fails after the commit occurs, but before the final stage completes, it is possible for duplicate data to be copied into the SQL table.

Verify that speculative execution is disabled in your Spark configuration: spark.speculation false. This is disabled by default. These configurations increase the possibility of retries.

Creating a temporary table to buffer the data and then MERGING it to the actual table could be a potential workaround.