cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Concurency behavior with merge operations

Bart_DE
New Contributor II

Hi community,

I have this case right now in project where i have to develop a solution that will prevent duplicate data from being ingested twice to delta lake. Some of our data suppliers at a rare occurence are sending us the same dataset in two different files in a period of just couple seconds. My first idea was to develop an ingestion log table that will hold a set of attributes that identifies a delivery and compare it before actual processing with payload. One of the first operations in a processing of a single file would be a single merge statement that "locks" some ingestion as correctly being processed and prevents others processes with the same data to proceed. I have read about concurrency and isolation levels in Databricks but i am not 100% sure how such merge statements fired from two different processes at the same time will behave? Can someone suggest if merge is an operation to go? 

1 ACCEPTED SOLUTION

Accepted Solutions

Walter_C
Databricks Employee
Databricks Employee

Your idea of using a log table to track processed ingestions and leveraging a MERGE operation in your pipeline is a sound approach for preventing duplicate data ingestion into Delta Lake. Delta Lake's ACID transactions and its support for concurrency make it well-suited for this use case. Addressing your specific concern regarding concurrent processes and the behavior of MERGE, here are some key points to consider:

How MERGE Handles Concurrency in Delta Lake

  1. ACID Transactions: Delta Lake ensures ACID properties, which means that concurrent MERGE statements will be serialized. Each transaction logs its changes to the Delta log, and conflicts are detected and resolved based on Spark's conflict detection mechanism.
  2. Concurrency Control: If two MERGE operations targeting overlapping keys or data are executed concurrently:
    • The first transaction to commit its changes will succeed.
    • The second transaction will fail if it tries to modify the same data, and it needs to be retried. This behavior is based on optimistic concurrency.
  3. Isolation Levels: Delta Lake offers snapshot isolation, ensuring that each transaction reads a consistent snapshot of the table. This makes your MERGE operation safe to perform concurrently.

View solution in original post

2 REPLIES 2

Walter_C
Databricks Employee
Databricks Employee

Your idea of using a log table to track processed ingestions and leveraging a MERGE operation in your pipeline is a sound approach for preventing duplicate data ingestion into Delta Lake. Delta Lake's ACID transactions and its support for concurrency make it well-suited for this use case. Addressing your specific concern regarding concurrent processes and the behavior of MERGE, here are some key points to consider:

How MERGE Handles Concurrency in Delta Lake

  1. ACID Transactions: Delta Lake ensures ACID properties, which means that concurrent MERGE statements will be serialized. Each transaction logs its changes to the Delta log, and conflicts are detected and resolved based on Spark's conflict detection mechanism.
  2. Concurrency Control: If two MERGE operations targeting overlapping keys or data are executed concurrently:
    • The first transaction to commit its changes will succeed.
    • The second transaction will fail if it tries to modify the same data, and it needs to be retried. This behavior is based on optimistic concurrency.
  3. Isolation Levels: Delta Lake offers snapshot isolation, ensuring that each transaction reads a consistent snapshot of the table. This makes your MERGE operation safe to perform concurrently.

Bart_DE
New Contributor II

Thank you @Walter_C for reply. I think it's all clear now. I have also found this great article that explain how row-level concurrency when deletion vectors and liquid clustering is enabled works. 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now