cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I save a large spark table (~88.3Mn rows) to a delta lake table

Abdurrahman
New Contributor II

I am trying to add a column to an existing delta lake table by adding a column and saving the table as a new table. The spark driver is getting overloaded. I have databricks notebook to work with (I have a decent compute as well g5.12xlarge) and have tried coalesce, sql magic command, writing to a new table using spark in batches of 1 million or 10 million using zipwithindex but nothing seems to work so far. 

Need help here

3 REPLIES 3

Joel742Bushong
New Contributor II

@Abdurrahman wrote:

I am trying to add a column to an existing delta lake table by adding a column and saving the table as a new table. The spark driver is getting overloaded. I have databricks notebook to work with (I have a decent compute as well g5.12xlarge) and have tried coalesce, sql magic command, writing to a new table using spark in batches of 1 million or 10 million using zipwithindex but nothing seems to work so far. 

Need help here


Hello!

To add a column to your Delta Lake table without overloading the Spark driver, try these solutions: use Delta Lake generated columns if the new column's value is derived from existing columns, optimize your Spark configuration for large-scale operations, experiment with different batch sizes for processing, and ensure there are no resource leaks in your code. Additionally, consult the Delta Lake documentation for best practices.

Sidhant07
Databricks Employee
Databricks Employee

Hi @Abdurrahman ,

To efficiently add a column to an existing Delta Lake table in a Databricks notebook when facing performance issues, consider the following strategies:• Optimize the Table: Before adding a new column, ensure that the table is optimized. Use the OPTIMIZE command to compact small files into larger ones, which can improve performance by reducing the number of files the system needs to manage.• Schema Evolution: Delta Lake supports schema evolution, allowing you to add new columns without rewriting the entire dataset. Use the ALTER TABLE command to add a new column:
 

sql
 ALTER TABLE <table_name> ADD COLUMNS (new_column_name <data_type>);
 

Amit_Dass
New Contributor II

Hi @Abdurrahman, Addition to the Sidhant07, I assumed you are adding this new column and you may be using this column in query, Use the ZORDER & OPTIMIZE both. 

  • ZORDER (Highly Recommended): Even more important than just OPTIMIZE for adding columns efficiently is using ZORDER. ZORDER sorts the data within the files based on the specified columns. If you frequently filter or query by certain columns (especially those related to the new column you're adding, if applicable), ZORDER dramatically improves query performance and makes metadata operations like adding columns much faster. Example:

SQL
 
OPTIMIZE your_delta_table ZORDER BY (column_used_in_filters, another_column);

 

 

 

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group