Concurency behavior with merge operations

Bart_DE · ‎04-24-2025

Hi community,

I have this case right now in project where i have to develop a solution that will prevent duplicate data from being ingested twice to delta lake. Some of our data suppliers at a rare occurence are sending us the same dataset in two different files in a period of just couple seconds. My first idea was to develop an ingestion log table that will hold a set of attributes that identifies a delivery and compare it before actual processing with payload. One of the first operations in a processing of a single file would be a single merge statement that "locks" some ingestion as correctly being processed and prevents others processes with the same data to proceed. I have read about concurrency and isolation levels in Databricks but i am not 100% sure how such merge statements fired from two different processes at the same time will behave? Can someone suggest if merge is an operation to go?