cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

What gets overridden when writing overriding a delta lake table?

rami-lv
New Contributor II

I just tried to write to a delta lake table using override mode, and I found that history is reserved. It's unclear to me how the data is overridden, and how long the history could be preserved.

As they say, a code is better than a thousand words:

mytable.write.mode("overwrite").saveAsTable("apps")
 
%sql 
SELECT id,display_name FROM apps WHERE id = 3; 
id | name
3  | old 
 
# mytabl gets updated
mytable.write.mode("overwrite").saveAsTable("apps")
 
 
%sql 
SELECT id,display_name FROM apps WHERE id = 3; 
id | name
3  | new 
 
 
%sql 
SELECT id,display_name FROM apps VERSION as OF 0 WHERE id = 3; 
id | name
3  | old 
 
 
 

Also, what is the performance and cost implication of having many versions of a single table?

3 REPLIES 3

Geeta1
Valued Contributor

Hi @Rami ALZEBAK​ , you can use DESCRIBE HISTORY table_name command to get the history of the table. All the updates that you make to the table are versioned and stored in the form of parquet files. By default, table history is retained up to 30 days. And as far as I know, there is no cost implication associated with preserving history. Also the performance is also based on the complexity of your code but is not related to the versioning. Hope it helps!

LandanG
Honored Contributor
Honored Contributor

Hi @Rami ALZEBAK​,

As @Geeta mentioned, you can use the DESCRIBE HISTORY command to see the history of the table, but you can also look at the underlying data to see the .json files generated when committing changes to the data, This documentation is really good for understanding the history and versioning for Delta tables https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-table-specification.

In terms of performance and cost, maintaining many versions of a delta table can affect performance and cost (although fairly minimal) so that's why you can use the VACUUM command to remove old versions and reduce costs & latency. This blog is good for going into more detail: https://mungingdata.com/delta-lake/vacuum-command/.

Ajay-Pandey
Esteemed Contributor III

Hi @Rami ALZEBAK​ overwrite means first it will remove the data and again it will write the whole data.

If you want to see history use can use DESCRIBE HISTORY command

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.