topic Re: How can I save a large spark table (~88.3Mn rows) to a delta lake table in Data Engineering

How can I save a large spark table (~88.3Mn rows) to a delta lake table

Abdurrahman — Thu, 30 Jan 2025 05:57:34 GMT

I am trying to add a column to an existing delta lake table by adding a column and saving the table as a new table. The spark driver is getting overloaded. I have databricks notebook to work with (I have a decent compute as well g5.12xlarge) and have tried coalesce, sql magic command, writing to a new table using spark in batches of 1 million or 10 million using zipwithindex but nothing seems to work so far.

Need help here

Re: How can I save a large spark table (~88.3Mn rows) to a delta lake table

Joel742Bushong — Thu, 30 Jan 2025 06:14:45 GMT

@Abdurrahman wrote:
I am trying to add a column to an existing delta lake table by adding a column and saving the table as a new table. The spark driver is getting overloaded. I have databricks notebook to work with (I have a decent compute as well g5.12xlarge) and have tried coalesce, sql magic command, writing to a new table using spark in batches of 1 million or 10 million using zipwithindex but nothing seems to work so far.
Need help here

Hello!

To add a column to your Delta Lake table without overloading the Spark driver, try these solutions: use Delta Lake generated columns if the new column's value is derived from existing columns, optimize your Spark configuration for large-scale operations, experiment with different batch sizes for processing, and ensure there are no resource leaks in your code. Additionally, consult the Delta Lake documentation for best practices.

Re: How can I save a large spark table (~88.3Mn rows) to a delta lake table

Sidhant07 — Thu, 30 Jan 2025 08:33:32 GMT

Hi @Abdurrahman ,

To efficiently add a column to an existing Delta Lake table in a Databricks notebook when facing performance issues, consider the following strategies:• Optimize the Table: Before adding a new column, ensure that the table is optimized. Use the OPTIMIZE command to compact small files into larger ones, which can improve performance by reducing the number of files the system needs to manage.• Schema Evolution: Delta Lake supports schema evolution, allowing you to add new columns without rewriting the entire dataset. Use the ALTER TABLE command to add a new column:

sql
  ALTER TABLE <table_name> ADD COLUMNS (new_column_name <data_type>);

Re: How can I save a large spark table (~88.3Mn rows) to a delta lake table

Amit_Dass — Thu, 30 Jan 2025 16:39:01 GMT

Hi @Abdurrahman, Addition to the Sidhant07, I assumed you are adding this new column and you may be using this column in query, Use the ZORDER & OPTIMIZE both.

ZORDER (Highly Recommended): Even more important than just OPTIMIZE for adding columns efficiently is using ZORDER. ZORDER sorts the data within the files based on the specified columns. If you frequently filter or query by certain columns (especially those related to the new column you're adding, if applicable), ZORDER dramatically improves query performance and makes metadata operations like adding columns much faster. Example:

SQL

OPTIMIZE your_delta_table ZORDER BY (column_used_in_filters, another_column);