Databricks Community

Bart_DE · ‎04-24-2025

Hi community,

I have this case right now in project where i have to develop a solution that will prevent duplicate data from being ingested twice to delta lake. Some of our data suppliers at a rare occurence are sending us the same dataset in two different files in a period of just couple seconds. My first idea was to develop an ingestion log table that will hold a set of attributes that identifies a delivery and compare it before actual processing with payload. One of the first operations in a processing of a single file would be a single merge statement that "locks" some ingestion as correctly being processed and prevents others processes with the same data to proceed. I have read about concurrency and isolation levels in Databricks but i am not 100% sure how such merge statements fired from two different processes at the same time will behave? Can someone suggest if merge is an operation to go?

Walter_C · ‎04-24-2025

Your idea of using a log table to track processed ingestions and leveraging a MERGE operation in your pipeline is a sound approach for preventing duplicate data ingestion into Delta Lake. Delta Lake's ACID transactions and its support for concurrency make it well-suited for this use case. Addressing your specific concern regarding concurrent processes and the behavior of MERGE, here are some key points to consider:

How `MERGE` Handles Concurrency in Delta Lake

ACID Transactions: Delta Lake ensures ACID properties, which means that concurrent MERGE statements will be serialized. Each transaction logs its changes to the Delta log, and conflicts are detected and resolved based on Spark's conflict detection mechanism.
Concurrency Control: If two MERGE operations targeting overlapping keys or data are executed concurrently:
- The first transaction to commit its changes will succeed.
- The second transaction will fail if it tries to modify the same data, and it needs to be retried. This behavior is based on optimistic concurrency.
Isolation Levels: Delta Lake offers snapshot isolation, ensuring that each transaction reads a consistent snapshot of the table. This makes your MERGE operation safe to perform concurrently.

View solution in original post

Walter_C · ‎04-24-2025

Your idea of using a log table to track processed ingestions and leveraging a MERGE operation in your pipeline is a sound approach for preventing duplicate data ingestion into Delta Lake. Delta Lake's ACID transactions and its support for concurrency make it well-suited for this use case. Addressing your specific concern regarding concurrent processes and the behavior of MERGE, here are some key points to consider:

How `MERGE` Handles Concurrency in Delta Lake

ACID Transactions: Delta Lake ensures ACID properties, which means that concurrent MERGE statements will be serialized. Each transaction logs its changes to the Delta log, and conflicts are detected and resolved based on Spark's conflict detection mechanism.
Concurrency Control: If two MERGE operations targeting overlapping keys or data are executed concurrently:
- The first transaction to commit its changes will succeed.
- The second transaction will fail if it tries to modify the same data, and it needs to be retried. This behavior is based on optimistic concurrency.
Isolation Levels: Delta Lake offers snapshot isolation, ensuring that each transaction reads a consistent snapshot of the table. This makes your MERGE operation safe to perform concurrently.

Bart_DE · ‎04-25-2025

Thank you @Walter_C for reply. I think it's all clear now. I have also found this great article that explain how row-level concurrency when deletion vectors and liquid clustering is enabled works.