cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unity catalog - external table lastUpdateversion

rudyevers
New Contributor III

We are currently upgrading our Lakehouse to use the Unity Catalog benefits. We will mostly use external tables because alle our DETLA tables are already stored in Azure Storage. I try to figure out how to update the table property "delta.lastUpdateverion". Since table schema's can change over time you want the external tables have the last schema version. Looking at the documentation, nothing is mentioned about this table property as far as I have seen. The data in the table did change when we updated the delta table, but somehow you fully need to recreate the table if the schema has changed. 

What is a best practice is this case? Recreate the external table? Or did I overlooked something?

 

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @rudyevers, I appreciate your understanding of the intricacies of working with external tables and Delta Lake. Indeed, when you create an external table that references data in an external location, it captures the version of the data at that specific moment. Subsequent schema changes to the underlying data do not automatically update the external table, as it continues to reference the previous version. The delta.lastUpdateversion property reflects this behaviour.

To ensure that the external table reflects the most up-to-date schema, you’re correct that dropping and recreating the external table is necessary. By doing so, you’ll align the delta.lastUpdateversion with the current state of the data.

If you have any further questions or need additional assistance, feel free to ask! 😊

View solution in original post

6 REPLIES 6

Kaniz_Fatma
Community Manager
Community Manager

Hi @rudyevers, When dealing with Delta Lake tables and schema updates, there are a few best practices to consider:

 

Schema Updates:

  • Delta Lake allows you to update the schema of a table. The following types of changes are supported:
    • Adding new columns (at arbitrary positions)
    • Reordering existing columns
    • Renaming existing columns
  • You can make these changes explicitly using DDL (Data Definition Language) or implicitly using DML (Data Manipulation Language).
  • If you want to add columns or modify the schema, you don’t necessarily need to recreate the entire table. Instead, you can explicitly update the schema using DDL statements.
  • For example:
    • To add columns, use: ALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name], ...)
    • To change column comment or ordering, use: ALTER TABLE table_name ALTER [COLUMN] col_name (COMMENT col_comment | FIRST | AFTER colA_name)
    • To replace columns, use: ALTER TABLE table_name REPLACE COLUMNS (col_name1 col_type1 [COMMENT col_comment1], ...)
    • To rename columns, use: ALTER TABLE table_name RENAME COLUMN old_col_name TO new_col_name
  • Note that adding nested columns is supported only for structs; arrays and maps are not supported.

Stream Termination:

  • When you update a Delta table schema, streams that read from that table terminate. If you want the stream to continue, you must restart it.
  • Consider this impact when planning schema updates.

Replacing the Whole Table:

  • If you need to change the entire schema, you can replace the whole table atomically.
  • This involves creating a new table with the updated schema and then swapping it with the existing table.
  • This approach ensures consistency and avoids partial updates.

Column Mapping for Renaming:

In summary, consider your specific use case:

  • If you only need to add or modify columns, use explicit schema updates.
  • If the entire schema needs to change, consider replacing the table atomically.
  • Always be aware of stream termination when updating the schema.

Remember that Delta Lake provides flexibility while maintaining data integrity, so choose the approa....

rudyevers
New Contributor III

Hi @Kaniz_Fatma 

Thank you for your reponse. I am aware that you can change your DELTA with de DDL statements, but in our case we write directly to storage and not to the unity catalog. So when a external table is created referring to a external location it takes the version at that specific moment. When the schema is changed afterwards the external table is not update because it is still referring to a previous version. That's also what the tabel property delta.lastUpdateversion is saying. So in this case it looks like you have to drop and create the external table so the delta.lastUpdateversion is the correct one. 

Hi @rudyevers, I appreciate your understanding of the intricacies of working with external tables and Delta Lake. Indeed, when you create an external table that references data in an external location, it captures the version of the data at that specific moment. Subsequent schema changes to the underlying data do not automatically update the external table, as it continues to reference the previous version. The delta.lastUpdateversion property reflects this behaviour.

To ensure that the external table reflects the most up-to-date schema, you’re correct that dropping and recreating the external table is necessary. By doing so, you’ll align the delta.lastUpdateversion with the current state of the data.

If you have any further questions or need additional assistance, feel free to ask! 😊

rudyevers
New Contributor III

Hi @Kaniz_Fatma,

OK! My assumption was right. So that is what we have to live with for now. But a sort of refresh table function would be nice for external DELTA tables 😉

The project team was quite early with adopting DELTA as storage format but is willing to adopt the Databricks capabilities more and more. But as early bird you sometimes suffer from the choices that are made in the past. Our whole data logistic process works in away that we are able to change it easly over night (manner of speaking). But we will get there over time.  

Thanks

-werners-
Esteemed Contributor III

I am in the same boat.

That is the reason I opted to use managed tables instead.  OK; it means migrating tables and changing notebooks but besides not having to struggle with external tables, you also get something in return (liquid clustering f.e.).

-werners-
Esteemed Contributor III

Liquid clustering ofc also exist for external tables, what I meant is all the upcoming AI-features, of which I doubt will be available for external tables.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!