<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to run OPTIMIZE to too big data set which has 11TB and more ? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-run-optimize-to-too-big-data-set-which-has-11tb-and-more/m-p/80930#M36161</link>
    <description>&lt;P&gt;Mr. jacovangelder, Thank you for your reply.&lt;/P&gt;&lt;P&gt;And Sorry for incontinence about my description of VM in AWS.&amp;nbsp;&lt;/P&gt;&lt;P&gt;There is no doubt that r6g.4xlarge x 1~6 or r6g.8xlarge x 1~3 is insufficient in terms of computational processing and memory capacity.&lt;/P&gt;&lt;P&gt;When I looked into it, as you said, OPTIMIZE seems to place a large load on the CPU and memory by calculating column statistics for skipping.&lt;/P&gt;&lt;P&gt;I would like to somehow convince my boss to use r6g.4xlarge x 1~6 or r6g.8xlarge x 1~3.&lt;/P&gt;&lt;P&gt;Your answer was very helpful. thank you. May good things be with you for your kindness.&lt;/P&gt;</description>
    <pubDate>Mon, 29 Jul 2024 01:57:03 GMT</pubDate>
    <dc:creator>Takao</dc:creator>
    <dc:date>2024-07-29T01:57:03Z</dc:date>
    <item>
      <title>How to run OPTIMIZE to too big data set which has 11TB and more ?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-run-optimize-to-too-big-data-set-which-has-11tb-and-more/m-p/80744#M36146</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Sorry for my very poor English and low Databricks Skill.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;At work, my boss asked me to perform liquid clustering on four columns for a Delta Lake table with an 11TB capacity and over 80 columns, and I was estimating the resources and costs required to implement it.&lt;/P&gt;&lt;P&gt;When I conveyed the results of the calculation to my boss, he was told that the cost was too high, so he had me execute the process using a cluster started with the following configuration.&lt;/P&gt;&lt;P&gt;・Cluster configuration&lt;BR /&gt;- Driver ... r6g.large x 1,&lt;BR /&gt;- Worker... r6g.large x min2 to max10(Auto-Scaling)&lt;/P&gt;&lt;P&gt;Of course, this dataset is so large that OPTIMIZE processing will not finish for more than five days.&lt;/P&gt;&lt;P&gt;Looking at the Spark UI, job processing is not progressing at all, and the amount of remaining tasks and spill is rapidly increasing to over 60TB.&lt;/P&gt;&lt;P&gt;OPTIMIZE is supposed to leave checkpoints, so I'm thinking of convincing my boss to cancel it now.&lt;/P&gt;&lt;P&gt;In that case, what kind of cluster configuration would be desirable to run it again?&lt;/P&gt;</description>
      <pubDate>Fri, 26 Jul 2024 16:13:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-run-optimize-to-too-big-data-set-which-has-11tb-and-more/m-p/80744#M36146</guid>
      <dc:creator>Takao</dc:creator>
      <dc:date>2024-07-26T16:13:23Z</dc:date>
    </item>
    <item>
      <title>Re: How to run OPTIMIZE to too big data set which has 11TB and more ?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-run-optimize-to-too-big-data-set-which-has-11tb-and-more/m-p/80795#M36153</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Couple of things:&lt;BR /&gt;OPTIMIZE is a very compute intensive operation. Make sure you pick a VM that is compute optimized.&lt;BR /&gt;I had to look into the AWS instances but it seems the&amp;nbsp;r6g.large you're using is just a 2 CPU 16GB machine. This is by far not sufficient enough to optimize a table of 11TB. The spill you're getting is the result of this. I would lower your mount of workers but scale up the VM's vertically, for example to a&amp;nbsp;r6g.4xlarge with 1-6 workers or a r6g.8xlarge with 1-3 workers.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;And last but not least, set the&amp;nbsp;&lt;STRONG&gt;delta.targetFileSize&amp;nbsp;&lt;/STRONG&gt;to 1GB. This is is the recommended size for tables of ~10TB.&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 27 Jul 2024 05:57:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-run-optimize-to-too-big-data-set-which-has-11tb-and-more/m-p/80795#M36153</guid>
      <dc:creator>jacovangelder</dc:creator>
      <dc:date>2024-07-27T05:57:24Z</dc:date>
    </item>
    <item>
      <title>Re: How to run OPTIMIZE to too big data set which has 11TB and more ?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-run-optimize-to-too-big-data-set-which-has-11tb-and-more/m-p/80930#M36161</link>
      <description>&lt;P&gt;Mr. jacovangelder, Thank you for your reply.&lt;/P&gt;&lt;P&gt;And Sorry for incontinence about my description of VM in AWS.&amp;nbsp;&lt;/P&gt;&lt;P&gt;There is no doubt that r6g.4xlarge x 1~6 or r6g.8xlarge x 1~3 is insufficient in terms of computational processing and memory capacity.&lt;/P&gt;&lt;P&gt;When I looked into it, as you said, OPTIMIZE seems to place a large load on the CPU and memory by calculating column statistics for skipping.&lt;/P&gt;&lt;P&gt;I would like to somehow convince my boss to use r6g.4xlarge x 1~6 or r6g.8xlarge x 1~3.&lt;/P&gt;&lt;P&gt;Your answer was very helpful. thank you. May good things be with you for your kindness.&lt;/P&gt;</description>
      <pubDate>Mon, 29 Jul 2024 01:57:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-run-optimize-to-too-big-data-set-which-has-11tb-and-more/m-p/80930#M36161</guid>
      <dc:creator>Takao</dc:creator>
      <dc:date>2024-07-29T01:57:03Z</dc:date>
    </item>
  </channel>
</rss>

