<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How can I save a large spark table (~88.3Mn rows) to a delta lake table in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-can-i-save-a-large-spark-table-88-3mn-rows-to-a-delta-lake/m-p/107699#M42889</link>
    <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/102882"&gt;@Abdurrahman&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;&lt;P&gt;I am trying to add a column to an existing delta lake table by adding a column and saving the table as a new table. The spark driver is getting overloaded. I have databricks notebook to work with (I have a decent compute as well g5.12xlarge) and have tried coalesce, sql magic command, writing to a new table using spark in batches of 1 million or 10 million using zipwithindex but nothing seems to work so far.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Need help here&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Hello!&lt;/P&gt;&lt;P&gt;To add a column to your Delta Lake table without overloading the Spark driver, try these solutions: use Delta Lake generated columns if the new column's value is derived from existing columns, optimize your Spark configuration for large-scale operations, experiment with different batch sizes for processing, and ensure there are no resource leaks in your code. Additionally, consult the Delta Lake documentation for best practices.&lt;/P&gt;</description>
    <pubDate>Thu, 30 Jan 2025 06:14:45 GMT</pubDate>
    <dc:creator>Joel742Bushong</dc:creator>
    <dc:date>2025-01-30T06:14:45Z</dc:date>
    <item>
      <title>How can I save a large spark table (~88.3Mn rows) to a delta lake table</title>
      <link>https://community.databricks.com/t5/data-engineering/how-can-i-save-a-large-spark-table-88-3mn-rows-to-a-delta-lake/m-p/107696#M42888</link>
      <description>&lt;P&gt;I am trying to add a column to an existing delta lake table by adding a column and saving the table as a new table. The spark driver is getting overloaded. I have databricks notebook to work with (I have a decent compute as well g5.12xlarge) and have tried coalesce, sql magic command, writing to a new table using spark in batches of 1 million or 10 million using zipwithindex but nothing seems to work so far.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Need help here&lt;/P&gt;</description>
      <pubDate>Thu, 30 Jan 2025 05:57:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-can-i-save-a-large-spark-table-88-3mn-rows-to-a-delta-lake/m-p/107696#M42888</guid>
      <dc:creator>Abdurrahman</dc:creator>
      <dc:date>2025-01-30T05:57:34Z</dc:date>
    </item>
    <item>
      <title>Re: How can I save a large spark table (~88.3Mn rows) to a delta lake table</title>
      <link>https://community.databricks.com/t5/data-engineering/how-can-i-save-a-large-spark-table-88-3mn-rows-to-a-delta-lake/m-p/107699#M42889</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/102882"&gt;@Abdurrahman&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;&lt;P&gt;I am trying to add a column to an existing delta lake table by adding a column and saving the table as a new table. The spark driver is getting overloaded. I have databricks notebook to work with (I have a decent compute as well g5.12xlarge) and have tried coalesce, sql magic command, writing to a new table using spark in batches of 1 million or 10 million using zipwithindex but nothing seems to work so far.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Need help here&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;Hello!&lt;/P&gt;&lt;P&gt;To add a column to your Delta Lake table without overloading the Spark driver, try these solutions: use Delta Lake generated columns if the new column's value is derived from existing columns, optimize your Spark configuration for large-scale operations, experiment with different batch sizes for processing, and ensure there are no resource leaks in your code. Additionally, consult the Delta Lake documentation for best practices.&lt;/P&gt;</description>
      <pubDate>Thu, 30 Jan 2025 06:14:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-can-i-save-a-large-spark-table-88-3mn-rows-to-a-delta-lake/m-p/107699#M42889</guid>
      <dc:creator>Joel742Bushong</dc:creator>
      <dc:date>2025-01-30T06:14:45Z</dc:date>
    </item>
    <item>
      <title>Re: How can I save a large spark table (~88.3Mn rows) to a delta lake table</title>
      <link>https://community.databricks.com/t5/data-engineering/how-can-i-save-a-large-spark-table-88-3mn-rows-to-a-delta-lake/m-p/107734#M42908</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/102882"&gt;@Abdurrahman&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;To efficiently add a column to an existing Delta Lake table in a Databricks notebook when facing performance issues, consider the following strategies:&lt;/SPAN&gt;&lt;SPAN&gt;•&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG data-stringify-type="bold"&gt;Optimize the Table&lt;/STRONG&gt;&lt;SPAN&gt;: Before adding a new column, ensure that the table is optimized. Use the&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="c-mrkdwn__code" data-stringify-type="code"&gt;OPTIMIZE&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;command to compact small files into larger ones, which can improve performance by reducing the number of files the system needs to manage.&lt;/SPAN&gt;&lt;SPAN&gt;•&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG data-stringify-type="bold"&gt;Schema Evolution&lt;/STRONG&gt;&lt;SPAN&gt;: Delta Lake supports schema evolution, allowing you to add new columns without rewriting the entire dataset. Use the&amp;nbsp;&lt;/SPAN&gt;&lt;CODE class="c-mrkdwn__code" data-stringify-type="code"&gt;ALTER TABLE&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;command to add a new column:&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;WBR /&gt;&lt;/P&gt;
&lt;PRE class="c-mrkdwn__pre" data-stringify-type="pre"&gt;sql&lt;BR /&gt; &amp;nbsp;&lt;WBR /&gt;ALTER TABLE &amp;lt;table_name&amp;gt; ADD COLUMNS (new_column_name &amp;lt;data_type&amp;gt;);&lt;BR /&gt; &amp;nbsp;&lt;/PRE&gt;</description>
      <pubDate>Thu, 30 Jan 2025 08:33:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-can-i-save-a-large-spark-table-88-3mn-rows-to-a-delta-lake/m-p/107734#M42908</guid>
      <dc:creator>Sidhant07</dc:creator>
      <dc:date>2025-01-30T08:33:32Z</dc:date>
    </item>
    <item>
      <title>Re: How can I save a large spark table (~88.3Mn rows) to a delta lake table</title>
      <link>https://community.databricks.com/t5/data-engineering/how-can-i-save-a-large-spark-table-88-3mn-rows-to-a-delta-lake/m-p/107870#M42935</link>
      <description>&lt;P&gt;Hi &lt;A href="https://community.databricks.com/t5/user/viewprofilepage/user-id/102882" target="_blank" rel="noopener"&gt;@Abdurrahman&lt;/A&gt;,&amp;nbsp;Addition to the Sidhant07, I assumed you are adding this new column and you may be using this column in query, Use the ZORDER &amp;amp; OPTIMIZE both.&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;ZORDER (Highly Recommended):&lt;/STRONG&gt; Even more important than just OPTIMIZE for adding columns efficiently is using ZORDER. ZORDER sorts the data within the files based on the specified columns. If you frequently filter or query by certain columns (especially those related to the new column you're adding, if applicable), ZORDER dramatically improves query performance &lt;EM&gt;and&lt;/EM&gt; makes metadata operations like adding columns much faster. Example:&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;SQL&lt;/SPAN&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;PRE&gt;OPTIMIZE your_delta_table ZORDER &lt;SPAN class=""&gt;BY&lt;/SPAN&gt; (column_used_in_filters, another_column);&lt;/PRE&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Jan 2025 16:39:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-can-i-save-a-large-spark-table-88-3mn-rows-to-a-delta-lake/m-p/107870#M42935</guid>
      <dc:creator>Amit_Dass</dc:creator>
      <dc:date>2025-01-30T16:39:01Z</dc:date>
    </item>
  </channel>
</rss>

