<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic OPTIMIZE command failed to complete on partitioned dataset in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/optimize-command-failed-to-complete-on-partitioned-dataset/m-p/16353#M10547</link>
    <description>&lt;P&gt;Trying to optimize delta table with following stats:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;size: 212,848 blobs, 31,162,417,246,985 bytes&lt;/LI&gt;&lt;LI&gt;command: OPTIMIZE &amp;lt;table&amp;gt; ZORDER BY (X, Y, Z)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;In Spark UI I can see all work divided to batches, and each batch start with 400 tasks to collect data. But each batch processing stage fails after collecting data. Error example:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;B&gt;Description&lt;/B&gt;: (Batch 11 [ Processing Files ((35651 - 39203) / 213211) ]) Optimizing 3553 files in abfss://&amp;lt;table&amp;gt;&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;Tasks: &lt;/B&gt;1510/3200&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;Failure Reason: &lt;/B&gt;Job aborted due to stage failure: Total size of serialized results of 1511 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Command runs with default configuration.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Question is why OPTIMIZE process chooses batches that doesn't match limit for spark.driver.maxResultSize? How can we configure splitting to create smaller batches?&lt;/P&gt;</description>
    <pubDate>Fri, 16 Dec 2022 15:25:35 GMT</pubDate>
    <dc:creator>MaximS</dc:creator>
    <dc:date>2022-12-16T15:25:35Z</dc:date>
    <item>
      <title>OPTIMIZE command failed to complete on partitioned dataset</title>
      <link>https://community.databricks.com/t5/data-engineering/optimize-command-failed-to-complete-on-partitioned-dataset/m-p/16353#M10547</link>
      <description>&lt;P&gt;Trying to optimize delta table with following stats:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;size: 212,848 blobs, 31,162,417,246,985 bytes&lt;/LI&gt;&lt;LI&gt;command: OPTIMIZE &amp;lt;table&amp;gt; ZORDER BY (X, Y, Z)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;In Spark UI I can see all work divided to batches, and each batch start with 400 tasks to collect data. But each batch processing stage fails after collecting data. Error example:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;B&gt;Description&lt;/B&gt;: (Batch 11 [ Processing Files ((35651 - 39203) / 213211) ]) Optimizing 3553 files in abfss://&amp;lt;table&amp;gt;&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;Tasks: &lt;/B&gt;1510/3200&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;Failure Reason: &lt;/B&gt;Job aborted due to stage failure: Total size of serialized results of 1511 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Command runs with default configuration.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Question is why OPTIMIZE process chooses batches that doesn't match limit for spark.driver.maxResultSize? How can we configure splitting to create smaller batches?&lt;/P&gt;</description>
      <pubDate>Fri, 16 Dec 2022 15:25:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimize-command-failed-to-complete-on-partitioned-dataset/m-p/16353#M10547</guid>
      <dc:creator>MaximS</dc:creator>
      <dc:date>2022-12-16T15:25:35Z</dc:date>
    </item>
    <item>
      <title>Re: OPTIMIZE command failed to complete on partitioned dataset</title>
      <link>https://community.databricks.com/t5/data-engineering/optimize-command-failed-to-complete-on-partitioned-dataset/m-p/16354#M10548</link>
      <description>&lt;P&gt;can you share some sample datasets for this by that we can debug and help you accordingly &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;Aviral&lt;/P&gt;</description>
      <pubDate>Sun, 18 Dec 2022 06:49:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimize-command-failed-to-complete-on-partitioned-dataset/m-p/16354#M10548</guid>
      <dc:creator>Aviral-Bhardwaj</dc:creator>
      <dc:date>2022-12-18T06:49:34Z</dc:date>
    </item>
  </channel>
</rss>

