Databricks Community

Abdurrahman · ‎01-29-2025

I am trying to add a column to an existing delta lake table by adding a column and saving the table as a new table. The spark driver is getting overloaded. I have databricks notebook to work with (I have a decent compute as well g5.12xlarge) and have tried coalesce, sql magic command, writing to a new table using spark in batches of 1 million or 10 million using zipwithindex but nothing seems to work so far.

Need help here

Joel742Bushong · ‎01-29-2025

@Abdurrahman wrote:
I am trying to add a column to an existing delta lake table by adding a column and saving the table as a new table. The spark driver is getting overloaded. I have databricks notebook to work with (I have a decent compute as well g5.12xlarge) and have tried coalesce, sql magic command, writing to a new table using spark in batches of 1 million or 10 million using zipwithindex but nothing seems to work so far.
Need help here

Hello!

To add a column to your Delta Lake table without overloading the Spark driver, try these solutions: use Delta Lake generated columns if the new column's value is derived from existing columns, optimize your Spark configuration for large-scale operations, experiment with different batch sizes for processing, and ensure there are no resource leaks in your code. Additionally, consult the Delta Lake documentation for best practices.

brightspace oneonta

Sidhant07 · ‎01-30-2025

Hi @Abdurrahman ,

To efficiently add a column to an existing Delta Lake table in a Databricks notebook when facing performance issues, consider the following strategies:• Optimize the Table: Before adding a new column, ensure that the table is optimized. Use the OPTIMIZE command to compact small files into larger ones, which can improve performance by reducing the number of files the system needs to manage.• Schema Evolution: Delta Lake supports schema evolution, allowing you to add new columns without rewriting the entire dataset. Use the ALTER TABLE command to add a new column:

sql
  ALTER TABLE <table_name> ADD COLUMNS (new_column_name <data_type>);

Amit_Dass · ‎01-30-2025

Hi @Abdurrahman, Addition to the Sidhant07, I assumed you are adding this new column and you may be using this column in query, Use the ZORDER & OPTIMIZE both.

ZORDER (Highly Recommended): Even more important than just OPTIMIZE for adding columns efficiently is using ZORDER. ZORDER sorts the data within the files based on the specified columns. If you frequently filter or query by certain columns (especially those related to the new column you're adding, if applicable), ZORDER dramatically improves query performance and makes metadata operations like adding columns much faster. Example:

SQL

OPTIMIZE your_delta_table ZORDER BY (column_used_in_filters, another_column);

Databricks Community

How can I save a large spark table (~88.3Mn rows) to a delta lake table

Join Us as a Local Community Builder!

Level Up with Databricks Specialist Sessions

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

Introducing Community Pulse — Your Weekly Databricks Roundup!