SteveOstrowski
Databricks Employee
Databricks Employee

Hi @Dhruv-22,

This is a valid use case and your workaround is solid. Let me share some context on the current behavior and a few approaches that may help streamline things.

CURRENT BEHAVIOR WITH SCHEMA EVOLUTION

When automatic schema evolution is enabled and you use MERGE ... WHEN MATCHED THEN UPDATE SET *, columns that exist in the target but not in the source are left unchanged. By contrast, WHEN NOT MATCHED THEN INSERT * does set target-only columns to NULL for new rows. This asymmetry is by design: the assumption is that an UPDATE should only modify the columns provided in the source, while an INSERT needs to fill all columns in a new row.

The documentation covers this here:
https://docs.databricks.com/en/delta/update-schema.html#merge-schema-evolution

WORKAROUND OPTIONS

Your dynamic MERGE approach is the right pattern for now. Here is a slightly more compact version that also handles the NOT MATCHED clause cleanly:

table_cols = spark.table("edw_nprd_aen.bronze.test_table").columns
view_cols = spark.table("test_view").columns

set_clause = ", ".join(
    [f"{col} = test_view.{col}" if col in view_cols else f"{col} = NULL"
     for col in table_cols]
)

merge_sql = f"""
MERGE INTO edw_nprd_aen.bronze.test_table
USING test_view
ON test_table.id = test_view.id
WHEN MATCHED THEN UPDATE SET {set_clause}
WHEN NOT MATCHED THEN INSERT *
"""

spark.sql(merge_sql)

Note that you do not need MERGE WITH SCHEMA EVOLUTION in this case, since you are explicitly listing all columns rather than relying on the wildcard expansion. Schema evolution is only needed when you want Spark to automatically add new source columns to the target schema.

If you use the Python Delta Lake API instead of SQL, you can achieve the same thing with a dictionary:

from delta.tables import DeltaTable

dt = DeltaTable.forName(spark, "edw_nprd_aen.bronze.test_table")

update_map = {
    col: f"source.{col}" if col in view_cols else "NULL"
    for col in table_cols
}

dt.alias("target").merge(
    spark.table("test_view").alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(set=update_map) \
 .whenNotMatchedInsertAll() \
 .execute()

FEATURE REQUEST

Your proposed syntax (UPDATE SET * NULL IF NOT PRESENT) is a reasonable idea. If you would like this considered by the product team, I recommend submitting it through the Databricks Ideas portal:
https://ideas.databricks.com

Ideas that get upvotes there are reviewed and prioritized by product engineering.

SUMMARY

- UPDATE SET * with schema evolution intentionally preserves target-only columns (leaves them unchanged).
- INSERT * with schema evolution does set target-only columns to NULL.
- For your use case, dynamically building the SET clause (as you are already doing) is the recommended approach.
- Submit the feature idea to https://ideas.databricks.com for visibility with the product team.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.