Databricks Community

rudyevers · ‎11-27-2023

We are currently upgrading our Lakehouse to use the Unity Catalog benefits. We will mostly use external tables because alle our DETLA tables are already stored in Azure Storage. I try to figure out how to update the table property "delta.lastUpdateverion". Since table schema's can change over time you want the external tables have the last schema version. Looking at the documentation, nothing is mentioned about this table property as far as I have seen. The data in the table did change when we updated the delta table, but somehow you fully need to recreate the table if the schema has changed.

What is a best practice is this case? Recreate the external table? Or did I overlooked something?

Kaniz_Fatma · ‎11-28-2023

Hi @rudyevers, I appreciate your understanding of the intricacies of working with external tables and Delta Lake. Indeed, when you create an external table that references data in an external location, it captures the version of the data at that specific moment. Subsequent schema changes to the underlying data do not automatically update the external table, as it continues to reference the previous version. The delta.lastUpdateversion property reflects this behaviour.

To ensure that the external table reflects the most up-to-date schema, you’re correct that dropping and recreating the external table is necessary. By doing so, you’ll align the delta.lastUpdateversion with the current state of the data.

If you have any further questions or need additional assistance, feel free to ask! 😊

View solution in original post

Kaniz_Fatma · ‎11-27-2023

Hi @rudyevers, When dealing with Delta Lake tables and schema updates, there are a few best practices to consider:

Schema Updates:

Delta Lake allows you to update the schema of a table. The following types of changes are supported:
- Adding new columns (at arbitrary positions)
- Reordering existing columns
- Renaming existing columns
You can make these changes explicitly using DDL (Data Definition Language) or implicitly using DML (Data Manipulation Language).
If you want to add columns or modify the schema, you don’t necessarily need to recreate the entire table. Instead, you can explicitly update the schema using DDL statements.
For example:
- To add columns, use: ALTER TABLE table_name ADD COLUMNS (col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name], ...)
- To change column comment or ordering, use: ALTER TABLE table_name ALTER [COLUMN] col_name (COMMENT col_comment | FIRST | AFTER colA_name)
- To replace columns, use: ALTER TABLE table_name REPLACE COLUMNS (col_name1 col_type1 [COMMENT col_comment1], ...)
- To rename columns, use: ALTER TABLE table_name RENAME COLUMN old_col_name TO new_col_name
Note that adding nested columns is supported only for structs; arrays and maps are not supported.

Stream Termination:

When you update a Delta table schema, streams that read from that table terminate. If you want the stream to continue, you must restart it.
Consider this impact when planning schema updates.

Replacing the Whole Table:

If you need to change the entire schema, you can replace the whole table atomically.
This involves creating a new table with the updated schema and then swapping it with the existing table.
This approach ensures consistency and avoids partial updates.

Column Mapping for Renaming:

If you want to rename columns without rewriting existing data, you can enable column mapping for the table.
This feature is available in Databricks Runtime 10.2 and above.
See the documentation for details on renaming and dropping columns with Delta Lake column mapping.

In summary, consider your specific use case:

If you only need to add or modify columns, use explicit schema updates.
If the entire schema needs to change, consider replacing the table atomically.
Always be aware of stream termination when updating the schema.

Remember that Delta Lake provides flexibility while maintaining data integrity, so choose the approa....

rudyevers · ‎11-28-2023

Hi @Kaniz_Fatma

Thank you for your reponse. I am aware that you can change your DELTA with de DDL statements, but in our case we write directly to storage and not to the unity catalog. So when a external table is created referring to a external location it takes the version at that specific moment. When the schema is changed afterwards the external table is not update because it is still referring to a previous version. That's also what the tabel property delta.lastUpdateversion is saying. So in this case it looks like you have to drop and create the external table so the delta.lastUpdateversion is the correct one.

Kaniz_Fatma · ‎11-28-2023

Hi @rudyevers, I appreciate your understanding of the intricacies of working with external tables and Delta Lake. Indeed, when you create an external table that references data in an external location, it captures the version of the data at that specific moment. Subsequent schema changes to the underlying data do not automatically update the external table, as it continues to reference the previous version. The delta.lastUpdateversion property reflects this behaviour.

To ensure that the external table reflects the most up-to-date schema, you’re correct that dropping and recreating the external table is necessary. By doing so, you’ll align the delta.lastUpdateversion with the current state of the data.

If you have any further questions or need additional assistance, feel free to ask! 😊

rudyevers · ‎11-28-2023

Hi @Kaniz_Fatma,

OK! My assumption was right. So that is what we have to live with for now. But a sort of refresh table function would be nice for external DELTA tables 😉

The project team was quite early with adopting DELTA as storage format but is willing to adopt the Databricks capabilities more and more. But as early bird you sometimes suffer from the choices that are made in the past. Our whole data logistic process works in away that we are able to change it easly over night (manner of speaking). But we will get there over time.

Thanks

-werners- · ‎11-29-2023

I am in the same boat.

That is the reason I opted to use managed tables instead. OK; it means migrating tables and changing notebooks but besides not having to struggle with external tables, you also get something in return (liquid clustering f.e.).

-werners- · ‎11-29-2023

Liquid clustering ofc also exist for external tables, what I meant is all the upcoming AI-features, of which I doubt will be available for external tables.

Databricks Community

Unity catalog - external table lastUpdateversion

🔔 ALERT: Act Now to Protect Your Community Account; Secure Your Details Before It's Too Late!

Databricks Learning Festival (Virtual): 10 July - 24 July 2024

Data + AI Summit 2024: An Executive Summary for Data Leaders

Big Data Is Back and Is More Important Than AI