<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Merge Performance Issues in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/merge-performance-issues/m-p/120071#M46050</link>
    <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/165538"&gt;@Harshul&lt;/a&gt;,&amp;nbsp;&lt;/P&gt;&lt;P&gt;1. Yes, if configured properly, INSERT OVERWRITE helps maintain data consistency. You can deduplicate incremental data using ROW_NUMBER() and use INSERT OVERWRITE with replaceWhere for efficient daily bulk updates.&lt;/P&gt;&lt;P&gt;2. Usually lower, since it avoid heavy shuffling and joins of merge&lt;/P&gt;&lt;P&gt;3. Yes, because only the affected partitions are rewritten, making it more efficient.&lt;/P&gt;</description>
    <pubDate>Fri, 23 May 2025 13:31:27 GMT</pubDate>
    <dc:creator>Renu_</dc:creator>
    <dc:date>2025-05-23T13:31:27Z</dc:date>
    <item>
      <title>Merge Performance Issues</title>
      <link>https://community.databricks.com/t5/data-engineering/merge-performance-issues/m-p/119972#M46011</link>
      <description>&lt;P&gt;&lt;SPAN&gt;The issues we are currently facing in our project is &lt;FONT color="#FF0000"&gt;&lt;STRONG&gt;Slow Merge Performance.&lt;/STRONG&gt;&lt;/FONT&gt;&lt;BR /&gt;Production database has 50 billion records (Historical data), data is partitioned(date) but not indexed.&lt;BR /&gt;Incremental data is has close 250 to 500 million records. Incremental data is for 2 dates mainly today and yesterday mostly inserts and updates no deletes.&lt;BR /&gt;So on running the merge statement it takes 1 hour to run the statement since databricks doesn't allow to write where in merge condition, so we have added a condition date &amp;gt;{condition} and &amp;lt;{condition} there is no significant improvement in the performance.&lt;BR /&gt;So I am thinking to try insert overwrite statement, so just wanted to know few details from the community.&lt;BR /&gt;&lt;FONT color="#333399"&gt;1&amp;nbsp;Will the data will be still be good to consume like logically ? like the same result i will get if I run the merge statement ?&lt;/FONT&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;FONT color="#333399"&gt;&lt;SPAN&gt;2 If yes the compute usage will go up down or nearly same?&lt;BR /&gt;3 Will it be able to provide any significant performance improvement?&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 22 May 2025 13:43:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/merge-performance-issues/m-p/119972#M46011</guid>
      <dc:creator>Harshul</dc:creator>
      <dc:date>2025-05-22T13:43:19Z</dc:date>
    </item>
    <item>
      <title>Re: Merge Performance Issues</title>
      <link>https://community.databricks.com/t5/data-engineering/merge-performance-issues/m-p/120071#M46050</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/165538"&gt;@Harshul&lt;/a&gt;,&amp;nbsp;&lt;/P&gt;&lt;P&gt;1. Yes, if configured properly, INSERT OVERWRITE helps maintain data consistency. You can deduplicate incremental data using ROW_NUMBER() and use INSERT OVERWRITE with replaceWhere for efficient daily bulk updates.&lt;/P&gt;&lt;P&gt;2. Usually lower, since it avoid heavy shuffling and joins of merge&lt;/P&gt;&lt;P&gt;3. Yes, because only the affected partitions are rewritten, making it more efficient.&lt;/P&gt;</description>
      <pubDate>Fri, 23 May 2025 13:31:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/merge-performance-issues/m-p/120071#M46050</guid>
      <dc:creator>Renu_</dc:creator>
      <dc:date>2025-05-23T13:31:27Z</dc:date>
    </item>
  </channel>
</rss>

