<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to write trillions of rows to unity catalog table. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-write-trillions-of-rows-to-unity-catalog-table/m-p/119539#M45904</link>
    <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/103581"&gt;@Nagarathna&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/116514"&gt;@Lucas_TBrabo&lt;/a&gt;&amp;nbsp;&amp;nbsp;I’d like to share my opinion and some tips that might help:&lt;BR /&gt;&lt;BR /&gt;1. You should try to avoud filtering by spark_partition_id because&amp;nbsp; you can create skewed partitions, you should use with&lt;STRONG&gt; repartition() and spark can optimize the execution plan.&lt;BR /&gt;&lt;BR /&gt;2. &lt;/STRONG&gt;You are using &lt;STRONG&gt;overwrite mode&lt;/STRONG&gt; That means you’re &lt;SPAN class=""&gt;deleting the entire table and rewriting it on every batch. This is not only inefficient, it also &lt;SPAN class=""&gt;negates the benefit of chunking, and will keep triggering large-scale execution plans repeatedly.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;If your goal is to &lt;SPAN class=""&gt;&lt;STRONG&gt;progressively load the entire dataset into a Unity Catalog table, you should use &lt;STRONG&gt;append&lt;BR /&gt;&lt;BR /&gt;3. Your current setup:&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P class=""&gt;4 × &lt;SPAN class=""&gt;i3.4xlarge = ~64 cores total, but low parallelism&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;Suggestion:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P class=""&gt;Enable &lt;SPAN class=""&gt;&lt;STRONG&gt;autoscaling, e.g. &lt;SPAN class=""&gt;min=8, &lt;SPAN class=""&gt;max=32 using&lt;SPAN class=""&gt;&amp;nbsp;&lt;STRONG&gt;smaller instances&lt;SPAN class=""&gt; like &lt;SPAN class=""&gt;i3.large&lt;SPAN class=""&gt; or &lt;SPAN class=""&gt;i3.xlarge&lt;SPAN class=""&gt; for &lt;STRONG&gt;higher parallelism&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;This improves resiliency and reduces the chance of single-point executor crashes&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I haven’t tried this myself, but these are some ideas that came to mind. If you give it a shot, feel free to share the results so others can benefit too.&lt;BR /&gt;&lt;BR /&gt;Hope this helps, &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;Isi&lt;/P&gt;</description>
    <pubDate>Sat, 17 May 2025 20:01:34 GMT</pubDate>
    <dc:creator>Isi</dc:creator>
    <dc:date>2025-05-17T20:01:34Z</dc:date>
    <item>
      <title>How to write trillions of rows to unity catalog table.</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-write-trillions-of-rows-to-unity-catalog-table/m-p/119153#M45807</link>
      <description>&lt;P&gt;Hi team,&lt;/P&gt;&lt;P&gt;I have a dataframe with&amp;nbsp;&lt;SPAN&gt;1269408570800 rows . I need to write this data to unity catalog table.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;How can I upload huge quantity of data ?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I'm using databricks i runtime 15.4 LTS with 4 workers and each worker type is&amp;nbsp;i3.4xlarge and driver of type&amp;nbsp;i3.4xlarge. I've tried to upload this data using chunksize but after 2 hours, it will result in an error.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Can anyone guide me as how can I upload such a huge data ? &lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 14 May 2025 10:21:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-write-trillions-of-rows-to-unity-catalog-table/m-p/119153#M45807</guid>
      <dc:creator>Nagarathna</dc:creator>
      <dc:date>2025-05-14T10:21:34Z</dc:date>
    </item>
    <item>
      <title>Re: How to write trillions of rows to unity catalog table.</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-write-trillions-of-rows-to-unity-catalog-table/m-p/119172#M45810</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/103581"&gt;@Nagarathna&lt;/a&gt;,&amp;nbsp;a couple of questions here...&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;What error is the job returning after the 2 hours execution?&lt;/LI&gt;
&lt;LI&gt;What is the data source we are talking about? Are you trying to load data from a database, another warehouse, a bunch of csv/json/excel files?&lt;/LI&gt;
&lt;LI&gt;How are you chunking the data? (if you could give a code example...)&lt;/LI&gt;
&lt;LI&gt;Are there any resource bottlenecks visible in the Spark UI (e.g., memory, CPU, disk, network)?&lt;/LI&gt;
&lt;LI&gt;Do you see any relevant warnings or errors in the Spark executor/driver logs before the failure?&lt;/LI&gt;
&lt;LI&gt;The cluster is fix sized with 4 workers or auto-scale is enabled?&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Wed, 14 May 2025 12:20:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-write-trillions-of-rows-to-unity-catalog-table/m-p/119172#M45810</guid>
      <dc:creator>Lucas_TBrabo</dc:creator>
      <dc:date>2025-05-14T12:20:30Z</dc:date>
    </item>
    <item>
      <title>Re: How to write trillions of rows to unity catalog table.</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-write-trillions-of-rows-to-unity-catalog-table/m-p/119275#M45826</link>
      <description>&lt;P&gt;Thanks for your reply.&lt;/P&gt;&lt;P&gt;Below is my answer to all the queries.&lt;/P&gt;&lt;P&gt;1. What error is the job returning after the 2 hours execution?&lt;BR /&gt;-&amp;gt; &lt;EM&gt;Below is the screenshot of error&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Nagarathna_0-1747288248186.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/16859i56A33E26352E9B8C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="Nagarathna_0-1747288248186.png" alt="Nagarathna_0-1747288248186.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;2.&amp;nbsp;What is the data source we are talking about? Are you trying to load data from a database, another warehouse, a bunch of csv/json/excel files?&lt;EM&gt;&lt;BR /&gt;-&amp;gt; Currently we are using pyspark and python code to create the data. So source is dataframe&lt;/EM&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;EM&gt;3.&amp;nbsp;&lt;/EM&gt;&lt;/SPAN&gt;How are you chunking the data? (if you could give a code example...)&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;-&amp;gt; &lt;EM&gt;df dataframe contains data.&lt;/EM&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;total_rows = 1269408570800
batch_size = 1000000
num_batches = (total_rows // batch_size) + (1 if total_rows % batch_size != 0 else 0)
print(num_batches)
 
for i in range(num_batches):
    batch_df = df.filter(f"spark_partition_id() == {i}")
    batch_df.write.mode("overwrite").saveAsTable(table_path)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;EM&gt;I've tried using sql to upload. Data will be loaded to table if I provide limit as 1,00,000. But If I increase it then this is also failing with same executor error. It will take time to upload if everytime we are only uploading 1,00,000 rows.&lt;/EM&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;INSERT INTO dspf_test.bulksaleshourlydataperformance_1845.t_hourly_sales_summary_result select * from temp_view limit 100000000. &lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;4.&amp;nbsp;Are there any resource bottlenecks visible in the Spark UI (e.g., memory, CPU, disk, network)?&lt;/P&gt;&lt;P&gt;-&amp;gt; &lt;EM&gt;I could see that the CPU utilisation for all the 4 executors have reached 80% .&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;5. Do you see any relevant warnings or errors in the Spark executor/driver logs before the failure?&lt;/P&gt;&lt;P&gt;-&amp;gt; Yes . &lt;EM&gt;ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 146027 ms&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;6.&amp;nbsp;&lt;/EM&gt;The cluster is fix sized with 4 workers or auto-scale is enabled?&lt;/P&gt;&lt;P&gt;-&amp;gt; &lt;EM&gt;autoscale is not enabled&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 15 May 2025 05:53:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-write-trillions-of-rows-to-unity-catalog-table/m-p/119275#M45826</guid>
      <dc:creator>Nagarathna</dc:creator>
      <dc:date>2025-05-15T05:53:12Z</dc:date>
    </item>
    <item>
      <title>Re: How to write trillions of rows to unity catalog table.</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-write-trillions-of-rows-to-unity-catalog-table/m-p/119539#M45904</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/103581"&gt;@Nagarathna&lt;/a&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/116514"&gt;@Lucas_TBrabo&lt;/a&gt;&amp;nbsp;&amp;nbsp;I’d like to share my opinion and some tips that might help:&lt;BR /&gt;&lt;BR /&gt;1. You should try to avoud filtering by spark_partition_id because&amp;nbsp; you can create skewed partitions, you should use with&lt;STRONG&gt; repartition() and spark can optimize the execution plan.&lt;BR /&gt;&lt;BR /&gt;2. &lt;/STRONG&gt;You are using &lt;STRONG&gt;overwrite mode&lt;/STRONG&gt; That means you’re &lt;SPAN class=""&gt;deleting the entire table and rewriting it on every batch. This is not only inefficient, it also &lt;SPAN class=""&gt;negates the benefit of chunking, and will keep triggering large-scale execution plans repeatedly.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;If your goal is to &lt;SPAN class=""&gt;&lt;STRONG&gt;progressively load the entire dataset into a Unity Catalog table, you should use &lt;STRONG&gt;append&lt;BR /&gt;&lt;BR /&gt;3. Your current setup:&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P class=""&gt;4 × &lt;SPAN class=""&gt;i3.4xlarge = ~64 cores total, but low parallelism&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;Suggestion:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P class=""&gt;Enable &lt;SPAN class=""&gt;&lt;STRONG&gt;autoscaling, e.g. &lt;SPAN class=""&gt;min=8, &lt;SPAN class=""&gt;max=32 using&lt;SPAN class=""&gt;&amp;nbsp;&lt;STRONG&gt;smaller instances&lt;SPAN class=""&gt; like &lt;SPAN class=""&gt;i3.large&lt;SPAN class=""&gt; or &lt;SPAN class=""&gt;i3.xlarge&lt;SPAN class=""&gt; for &lt;STRONG&gt;higher parallelism&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P class=""&gt;This improves resiliency and reduces the chance of single-point executor crashes&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I haven’t tried this myself, but these are some ideas that came to mind. If you give it a shot, feel free to share the results so others can benefit too.&lt;BR /&gt;&lt;BR /&gt;Hope this helps, &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;Isi&lt;/P&gt;</description>
      <pubDate>Sat, 17 May 2025 20:01:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-write-trillions-of-rows-to-unity-catalog-table/m-p/119539#M45904</guid>
      <dc:creator>Isi</dc:creator>
      <dc:date>2025-05-17T20:01:34Z</dc:date>
    </item>
  </channel>
</rss>

