<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: MERGE operation on PI data getting slower. How can I debug? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/merge-operation-on-pi-data-getting-slower-how-can-i-debug/m-p/24824#M17267</link>
    <description>&lt;P&gt;Delta Lake &lt;A href="https://databricks.com/blog/2020/09/29/diving-into-delta-lake-dml-internals-update-delete-merge.html" alt="https://databricks.com/blog/2020/09/29/diving-into-delta-lake-dml-internals-update-delete-merge.html" target="_blank"&gt;completes a&amp;nbsp; MERGE &amp;nbsp;in two steps&lt;/A&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Perform an&amp;nbsp;&lt;B&gt;inner join&lt;/B&gt;&amp;nbsp;between the target table and source table to select all files that have matches.&lt;/LI&gt;&lt;LI&gt;Perform an&amp;nbsp;&lt;B&gt;outer join&lt;/B&gt;&amp;nbsp;between the selected files in the target and source tables and write out the updated/deleted/inserted data.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If finding the files that Delta Lake needs to rewrite is taking too long, try:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Add more predicates to narrow down the search space.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Adjust shuffle partitions.&lt;/LI&gt;&lt;LI&gt;Adjust broadcast join thresholds.&lt;/LI&gt;&lt;LI&gt;Right-size the files ( balance between too many small files  vs few large files )&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If rewriting the actual files itself is  taking too long, try:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Adjust shuffle partitions / AQE &lt;/LI&gt;&lt;LI&gt;Enable &lt;A href="https://docs.databricks.com/delta/optimizations/auto-optimize.html" alt="https://docs.databricks.com/delta/optimizations/auto-optimize.html" target="_blank"&gt;Optimized writes&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Adjust broadcast thresholds.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 21 Jun 2021 22:08:55 GMT</pubDate>
    <dc:creator>sajith_appukutt</dc:creator>
    <dc:date>2021-06-21T22:08:55Z</dc:date>
    <item>
      <title>MERGE operation on PI data getting slower. How can I debug?</title>
      <link>https://community.databricks.com/t5/data-engineering/merge-operation-on-pi-data-getting-slower-how-can-i-debug/m-p/24823#M17266</link>
      <description>&lt;P&gt;We have a structured streaming job configured to read from event-hub and persist to the delta raw/bronze layer via MERGE inside a &lt;B&gt;foreachBatch, However of-late, the merge process is taking longer time. How can i optimize this pipeline ?&lt;/B&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 13 Jun 2021 23:55:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/merge-operation-on-pi-data-getting-slower-how-can-i-debug/m-p/24823#M17266</guid>
      <dc:creator>sajith_appukutt</dc:creator>
      <dc:date>2021-06-13T23:55:00Z</dc:date>
    </item>
    <item>
      <title>Re: MERGE operation on PI data getting slower. How can I debug?</title>
      <link>https://community.databricks.com/t5/data-engineering/merge-operation-on-pi-data-getting-slower-how-can-i-debug/m-p/24824#M17267</link>
      <description>&lt;P&gt;Delta Lake &lt;A href="https://databricks.com/blog/2020/09/29/diving-into-delta-lake-dml-internals-update-delete-merge.html" alt="https://databricks.com/blog/2020/09/29/diving-into-delta-lake-dml-internals-update-delete-merge.html" target="_blank"&gt;completes a&amp;nbsp; MERGE &amp;nbsp;in two steps&lt;/A&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Perform an&amp;nbsp;&lt;B&gt;inner join&lt;/B&gt;&amp;nbsp;between the target table and source table to select all files that have matches.&lt;/LI&gt;&lt;LI&gt;Perform an&amp;nbsp;&lt;B&gt;outer join&lt;/B&gt;&amp;nbsp;between the selected files in the target and source tables and write out the updated/deleted/inserted data.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If finding the files that Delta Lake needs to rewrite is taking too long, try:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Add more predicates to narrow down the search space.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Adjust shuffle partitions.&lt;/LI&gt;&lt;LI&gt;Adjust broadcast join thresholds.&lt;/LI&gt;&lt;LI&gt;Right-size the files ( balance between too many small files  vs few large files )&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If rewriting the actual files itself is  taking too long, try:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Adjust shuffle partitions / AQE &lt;/LI&gt;&lt;LI&gt;Enable &lt;A href="https://docs.databricks.com/delta/optimizations/auto-optimize.html" alt="https://docs.databricks.com/delta/optimizations/auto-optimize.html" target="_blank"&gt;Optimized writes&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Adjust broadcast thresholds.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 21 Jun 2021 22:08:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/merge-operation-on-pi-data-getting-slower-how-can-i-debug/m-p/24824#M17267</guid>
      <dc:creator>sajith_appukutt</dc:creator>
      <dc:date>2021-06-21T22:08:55Z</dc:date>
    </item>
  </channel>
</rss>

