Databricks Community

noimeta · ‎07-28-2022

Hi,

Currently, I'm using structure streaming to insert/update/delete to a table. A row will be deleted if value in 'Operation' column is 'deleted'. Everything seems to work fine until there's a new column.

Since I don't need 'Operation' column in the target table, I use whenMatchedUpdate(set=

..) and whenNotMatchedInsert(values=..) instead of whenMatchedUpdateAll() and whenNotMatchedInsertAll(). However, from the document, it seems the schema evolution occurs only when there is either an updateAll or an insertAll or both. The 'Operation' column also can't be dropped since it's needed in merge (delete) condition.

Is there any way to automatically add a new column and also drop some columns before merging?

Hubert-Dudek · ‎07-29-2022

To help it in that case, I think I would need to see more data + sample data.

You can also implement live delta tables - there are new function apply_changes which can be excellent in your case https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-cdc.html

noimeta · ‎07-31-2022

Thank you for your answer. I haven't tried delta live table yet, but it's on the future plan.

Anyway, the sample data looks something like:

bronze table

Screen Shot 2565-08-01 at 10.02.13 silver table

Screen Shot 2565-08-01 at 10.02.37

Then, the schema of the bronze table automatically got updated with a new column

Screen Shot 2565-08-01 at 10.03.30 This is the result I want for the silver table

Screen Shot 2565-08-01 at 10.03.51

Currently, I have to manually update the schema of the silver table.

If I use whenMatchedUpdateAll() and whenNotMatchedInsertAll(), the Op column will be added to the silver table.

If I use whenMatchedUpdate() and whenNotMatchedInsert(), the column a1 won't be added to the table.

User16753725469 · ‎09-01-2022

please go through this documentation https://docs.delta.io/latest/api/python/index.html

noimeta · ‎09-01-2022

Thank you for the document. It's very helpful.

From the doc, I thought I would be able to use

deltaTable = DeltaTable.replace(sparkSession)
    .tableName("testTable")
    .addColumns(df.schema)
    .execute()

to update the schema in the code when some schema change is detected.

Anyway, this piece of code does replace the table, so not only the data got update, but all the data are also gone.

Do you have any other suggestion?

Databricks Community

Apply change data with delete and schema evolution

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!