<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Merge Operation is very slow for S/4 Table ACDOCA in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/merge-operation-is-very-slow-for-s-4-table-acdoca/m-p/49555#M28581</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;we have a scenario in Databricks where every day&amp;nbsp; we get 60-70 million records&amp;nbsp; and it takes a lot of time to merge the data into 28 billion records which is already sitting there . The time taken to rewrite the files which are affected is too much. Merge time is not directly proportional to number of records in delta but solely depends on number of files delta is updating. Table is partitioned on Period and each period has around 800 million records which is sitting there and delta records are present in 3 years&amp;nbsp; basically in all 36 partition and sometimes it can go till 2020 also.&lt;/P&gt;&lt;P&gt;Please note this is a one to one table from source with no logic at all.&lt;/P&gt;&lt;P&gt;we have tried all the spark settings , Optimize the table , Zordering , Big cluster with Photon ( E16 ) but still it takes a lot of time to rewrite the updated files.&lt;/P&gt;&lt;P&gt;can anyone suggest something or if someone has done similar before and improved the performance.&lt;/P&gt;&lt;P&gt;Table Size is 1.4 TB&lt;/P&gt;&lt;P&gt;Columns - 563&lt;/P&gt;&lt;P&gt;Partioned by Period&lt;/P&gt;&lt;P&gt;Time take to merge and rewrite files - over 10 hours to update 3000 files and files are also not that huge in terms of size.&lt;/P&gt;&lt;P&gt;Storage - Azure Blob Gen 2 in Parquet format&lt;/P&gt;&lt;P&gt;Type of Table&amp;nbsp; - Delta&lt;/P&gt;&lt;P&gt;if someone could help then it would be great &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 20 Oct 2023 03:10:29 GMT</pubDate>
    <dc:creator>Kishan1003</dc:creator>
    <dc:date>2023-10-20T03:10:29Z</dc:date>
    <item>
      <title>Merge Operation is very slow for S/4 Table ACDOCA</title>
      <link>https://community.databricks.com/t5/data-engineering/merge-operation-is-very-slow-for-s-4-table-acdoca/m-p/49555#M28581</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;we have a scenario in Databricks where every day&amp;nbsp; we get 60-70 million records&amp;nbsp; and it takes a lot of time to merge the data into 28 billion records which is already sitting there . The time taken to rewrite the files which are affected is too much. Merge time is not directly proportional to number of records in delta but solely depends on number of files delta is updating. Table is partitioned on Period and each period has around 800 million records which is sitting there and delta records are present in 3 years&amp;nbsp; basically in all 36 partition and sometimes it can go till 2020 also.&lt;/P&gt;&lt;P&gt;Please note this is a one to one table from source with no logic at all.&lt;/P&gt;&lt;P&gt;we have tried all the spark settings , Optimize the table , Zordering , Big cluster with Photon ( E16 ) but still it takes a lot of time to rewrite the updated files.&lt;/P&gt;&lt;P&gt;can anyone suggest something or if someone has done similar before and improved the performance.&lt;/P&gt;&lt;P&gt;Table Size is 1.4 TB&lt;/P&gt;&lt;P&gt;Columns - 563&lt;/P&gt;&lt;P&gt;Partioned by Period&lt;/P&gt;&lt;P&gt;Time take to merge and rewrite files - over 10 hours to update 3000 files and files are also not that huge in terms of size.&lt;/P&gt;&lt;P&gt;Storage - Azure Blob Gen 2 in Parquet format&lt;/P&gt;&lt;P&gt;Type of Table&amp;nbsp; - Delta&lt;/P&gt;&lt;P&gt;if someone could help then it would be great &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 20 Oct 2023 03:10:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/merge-operation-is-very-slow-for-s-4-table-acdoca/m-p/49555#M28581</guid>
      <dc:creator>Kishan1003</dc:creator>
      <dc:date>2023-10-20T03:10:29Z</dc:date>
    </item>
    <item>
      <title>Re: Merge Operation is very slow for S/4 Table ACDOCA</title>
      <link>https://community.databricks.com/t5/data-engineering/merge-operation-is-very-slow-for-s-4-table-acdoca/m-p/55278#M30289</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/91955"&gt;@Kishan1003&lt;/a&gt;&amp;nbsp; did you find something helpful? Im dealing with a similar situation, acdoca table on my side is around 300M (fairly smaller), and incoming daily data is usually around 1M. I have try partition using period, like fiscyearper column, zorder and dynamic prunning. So far the best time of the merge process has been around 1 hour. I want to understand if I can achieve a better performance before scaling.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 14 Dec 2023 23:17:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/merge-operation-is-very-slow-for-s-4-table-acdoca/m-p/55278#M30289</guid>
      <dc:creator>177991</dc:creator>
      <dc:date>2023-12-14T23:17:06Z</dc:date>
    </item>
  </channel>
</rss>

