How can I save a large spark table (~88.3Mn rows) to a delta lake table
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
I am trying to add a column to an existing delta lake table by adding a column and saving the table as a new table. The spark driver is getting overloaded. I have databricks notebook to work with (I have a decent compute as well g5.12xlarge) and have tried coalesce, sql magic command, writing to a new table using spark in batches of 1 million or 10 million using zipwithindex but nothing seems to work so far.
Need help here
- Labels:
-
Delta Lake
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
@Abdurrahman wrote:I am trying to add a column to an existing delta lake table by adding a column and saving the table as a new table. The spark driver is getting overloaded. I have databricks notebook to work with (I have a decent compute as well g5.12xlarge) and have tried coalesce, sql magic command, writing to a new table using spark in batches of 1 million or 10 million using zipwithindex but nothing seems to work so far.
Need help here
Hello!
To add a column to your Delta Lake table without overloading the Spark driver, try these solutions: use Delta Lake generated columns if the new column's value is derived from existing columns, optimize your Spark configuration for large-scale operations, experiment with different batch sizes for processing, and ensure there are no resource leaks in your code. Additionally, consult the Delta Lake documentation for best practices.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Hi @Abdurrahman ,
To efficiently add a column to an existing Delta Lake table in a Databricks notebook when facing performance issues, consider the following strategies:• Optimize the Table: Before adding a new column, ensure that the table is optimized. Use the OPTIMIZE
command to compact small files into larger ones, which can improve performance by reducing the number of files the system needs to manage.• Schema Evolution: Delta Lake supports schema evolution, allowing you to add new columns without rewriting the entire dataset. Use the ALTER TABLE
command to add a new column:
sql
ALTER TABLE <table_name> ADD COLUMNS (new_column_name <data_type>);
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Hi @Abdurrahman, Addition to the Sidhant07, I assumed you are adding this new column and you may be using this column in query, Use the ZORDER & OPTIMIZE both.
ZORDER (Highly Recommended): Even more important than just OPTIMIZE for adding columns efficiently is using ZORDER. ZORDER sorts the data within the files based on the specified columns. If you frequently filter or query by certain columns (especially those related to the new column you're adding, if applicable), ZORDER dramatically improves query performance and makes metadata operations like adding columns much faster. Example:
OPTIMIZE your_delta_table ZORDER BY (column_used_in_filters, another_column);

