<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Concurency behavior with merge operations in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/concurency-behavior-with-merge-operations/m-p/116473#M45322</link>
    <description>&lt;P class="_1t7bu9h1 paragraph"&gt;Your idea of using a log table to track processed ingestions and leveraging a &lt;CODE&gt;MERGE&lt;/CODE&gt; operation in your pipeline is a sound approach for preventing duplicate data ingestion into Delta Lake. Delta Lake's ACID transactions and its support for concurrency make it well-suited for this use case. Addressing your specific concern regarding concurrent processes and the behavior of &lt;CODE&gt;MERGE&lt;/CODE&gt;, here are some key points to consider:&lt;/P&gt;
&lt;H3 class="_1jeaq5e0 _1t7bu9h9 heading3"&gt;How &lt;CODE&gt;MERGE&lt;/CODE&gt; Handles Concurrency in Delta Lake&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;ACID Transactions:&lt;/STRONG&gt; Delta Lake ensures ACID properties, which means that concurrent &lt;CODE&gt;MERGE&lt;/CODE&gt; statements will be serialized. Each transaction logs its changes to the Delta log, and conflicts are detected and resolved based on Spark's conflict detection mechanism.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Concurrency Control:&lt;/STRONG&gt; If two &lt;CODE&gt;MERGE&lt;/CODE&gt; operations targeting overlapping keys or data are executed concurrently:
&lt;UL class="_1t7bu9h7 _1t7bu9h2"&gt;
&lt;LI&gt;The first transaction to commit its changes will succeed.&lt;/LI&gt;
&lt;LI&gt;The second transaction will fail if it tries to modify the same data, and it needs to be retried. This behavior is based on optimistic concurrency.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Isolation Levels:&lt;/STRONG&gt; Delta Lake offers snapshot isolation, ensuring that each transaction reads a consistent snapshot of the table. This makes your &lt;CODE&gt;MERGE&lt;/CODE&gt; operation safe to perform concurrently.&lt;/LI&gt;
&lt;/OL&gt;</description>
    <pubDate>Thu, 24 Apr 2025 12:50:05 GMT</pubDate>
    <dc:creator>Walter_C</dc:creator>
    <dc:date>2025-04-24T12:50:05Z</dc:date>
    <item>
      <title>Concurency behavior with merge operations</title>
      <link>https://community.databricks.com/t5/data-engineering/concurency-behavior-with-merge-operations/m-p/116454#M45319</link>
      <description>&lt;P&gt;Hi community,&lt;/P&gt;&lt;P&gt;I have this case right now in project where i have to develop a solution that will prevent duplicate data from being ingested twice to delta lake. Some of our data suppliers at a rare occurence are sending us the same dataset in two different files in a period of just couple seconds. My first idea was to develop an ingestion log table that will hold a set of attributes that identifies a delivery and compare it before actual processing with payload. One of the first operations in a processing of a single file would be a single merge statement that "locks" some ingestion as correctly being processed and prevents others processes with the same data to proceed. I have read about concurrency and isolation levels in Databricks but&amp;nbsp;i am not 100% sure how such merge statements fired from two different processes at the same time will behave? Can someone suggest if merge is an operation to go?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 24 Apr 2025 10:59:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/concurency-behavior-with-merge-operations/m-p/116454#M45319</guid>
      <dc:creator>Bart_DE</dc:creator>
      <dc:date>2025-04-24T10:59:40Z</dc:date>
    </item>
    <item>
      <title>Re: Concurency behavior with merge operations</title>
      <link>https://community.databricks.com/t5/data-engineering/concurency-behavior-with-merge-operations/m-p/116473#M45322</link>
      <description>&lt;P class="_1t7bu9h1 paragraph"&gt;Your idea of using a log table to track processed ingestions and leveraging a &lt;CODE&gt;MERGE&lt;/CODE&gt; operation in your pipeline is a sound approach for preventing duplicate data ingestion into Delta Lake. Delta Lake's ACID transactions and its support for concurrency make it well-suited for this use case. Addressing your specific concern regarding concurrent processes and the behavior of &lt;CODE&gt;MERGE&lt;/CODE&gt;, here are some key points to consider:&lt;/P&gt;
&lt;H3 class="_1jeaq5e0 _1t7bu9h9 heading3"&gt;How &lt;CODE&gt;MERGE&lt;/CODE&gt; Handles Concurrency in Delta Lake&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;ACID Transactions:&lt;/STRONG&gt; Delta Lake ensures ACID properties, which means that concurrent &lt;CODE&gt;MERGE&lt;/CODE&gt; statements will be serialized. Each transaction logs its changes to the Delta log, and conflicts are detected and resolved based on Spark's conflict detection mechanism.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Concurrency Control:&lt;/STRONG&gt; If two &lt;CODE&gt;MERGE&lt;/CODE&gt; operations targeting overlapping keys or data are executed concurrently:
&lt;UL class="_1t7bu9h7 _1t7bu9h2"&gt;
&lt;LI&gt;The first transaction to commit its changes will succeed.&lt;/LI&gt;
&lt;LI&gt;The second transaction will fail if it tries to modify the same data, and it needs to be retried. This behavior is based on optimistic concurrency.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Isolation Levels:&lt;/STRONG&gt; Delta Lake offers snapshot isolation, ensuring that each transaction reads a consistent snapshot of the table. This makes your &lt;CODE&gt;MERGE&lt;/CODE&gt; operation safe to perform concurrently.&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Thu, 24 Apr 2025 12:50:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/concurency-behavior-with-merge-operations/m-p/116473#M45322</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2025-04-24T12:50:05Z</dc:date>
    </item>
    <item>
      <title>Re: Concurency behavior with merge operations</title>
      <link>https://community.databricks.com/t5/data-engineering/concurency-behavior-with-merge-operations/m-p/116589#M45347</link>
      <description>&lt;P&gt;Thank you&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/88823"&gt;@Walter_C&lt;/a&gt;&amp;nbsp;for reply. I think it's all clear now. I have also found &lt;A href="https://www.databricks.com/blog/deep-dive-how-row-level-concurrency-works-out-box" target="_self"&gt;this great article&lt;/A&gt; that explain how row-level concurrency when deletion vectors and liquid clustering is enabled works.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 25 Apr 2025 14:24:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/concurency-behavior-with-merge-operations/m-p/116589#M45347</guid>
      <dc:creator>Bart_DE</dc:creator>
      <dc:date>2025-04-25T14:24:50Z</dc:date>
    </item>
  </channel>
</rss>

