cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What gets overridden when writing overriding a delta lake table?

rami-lv
New Contributor II

I just tried to write to a delta lake table using override mode, and I found that history is reserved. It's unclear to me how the data is overridden, and how long the history could be preserved.

As they say, a code is better than a thousand words:

mytable.write.mode("overwrite").saveAsTable("apps")
 
%sql 
SELECT id,display_name FROM apps WHERE id = 3; 
id | name
3  | old 
 
# mytabl gets updated
mytable.write.mode("overwrite").saveAsTable("apps")
 
 
%sql 
SELECT id,display_name FROM apps WHERE id = 3; 
id | name
3  | new 
 
 
%sql 
SELECT id,display_name FROM apps VERSION as OF 0 WHERE id = 3; 
id | name
3  | old 
 
 
 

Also, what is the performance and cost implication of having many versions of a single table?

3 REPLIES 3

Geeta1
Valued Contributor

Hi @Rami ALZEBAK​ , you can use DESCRIBE HISTORY table_name command to get the history of the table. All the updates that you make to the table are versioned and stored in the form of parquet files. By default, table history is retained up to 30 days. And as far as I know, there is no cost implication associated with preserving history. Also the performance is also based on the complexity of your code but is not related to the versioning. Hope it helps!

LandanG
Databricks Employee
Databricks Employee

Hi @Rami ALZEBAK​,

As @Geeta mentioned, you can use the DESCRIBE HISTORY command to see the history of the table, but you can also look at the underlying data to see the .json files generated when committing changes to the data, This documentation is really good for understanding the history and versioning for Delta tables https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-table-specification.

In terms of performance and cost, maintaining many versions of a delta table can affect performance and cost (although fairly minimal) so that's why you can use the VACUUM command to remove old versions and reduce costs & latency. This blog is good for going into more detail: https://mungingdata.com/delta-lake/vacuum-command/.

Ajay-Pandey
Esteemed Contributor III

Hi @Rami ALZEBAK​ overwrite means first it will remove the data and again it will write the whole data.

If you want to see history use can use DESCRIBE HISTORY command

Ajay Kumar Pandey

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group