โ01-09-2026 07:10 AM
I want to raise a feature request as follows.
Currently, in the Automatic schema evolution for merge when a column is not present in the source dataset it is not changed in the target dataset. For e.g.
%sql
CREATE OR REPLACE TABLE edw_nprd_aen.bronze.test_table (
id INT
, assignments ARRAY<STRING>
);
INSERT INTO edw_nprd_aen.bronze.test_table
VALUES
( 1, ARRAY('S1-1', 'S1-2') )
, (2, NULL)
, (3, ARRAY('S3-1'))
, (4, ARRAY('S4-1'));
SELECT *
FROM edw_nprd_aen.bronze.test_table;%sql
CREATE OR REPLACE TEMPORARY VIEW test_view (id) AS VALUES (1), (3);
SELECT *
FROM test_view%sql
MERGE WITH SCHEMA EVOLUTION INTO edw_nprd_aen.bronze.test_table
USING test_view
ON test_table.id = test_view.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
SELECT *
FROM edw_nprd_aen.bronze.test_table
ORDER BY id---
What I want is that if the row matches and the column is not present then set it to null. Like the following
Currently, I'm achieving this as follows
table_cols = spark.table('edw_nprd_aen.bronze.test_table').columns
view_cols = spark.table('test_view').columns
merge_string = f"""
MERGE WITH SCHEMA EVOLUTION INTO edw_nprd_aen.bronze.test_table
USING test_view
ON test_table.id = test_view.id
WHEN MATCHED THEN UPDATE SET
{'\n, '.join([f"{col} = test_view.{col}" if col in view_cols else f"{col} = NULL" for col in table_cols])}
WHEN NOT MATCHED THEN INSERT *
"""
print(merge_string)
spark.sql(merge_string)
It will be great if there could be some in-built option for the same, like
UPDATE SET * NULL IF NOT PRESENT
โ01-10-2026 05:06 AM - edited โ01-10-2026 05:07 AM
Problem
When using MERGE INTO ... WITH SCHEMA EVOLUTION, if a column exists in the target table but is not present in the source dataset, that column is left unchanged on matched rows.
Solution Thinking
This can be emulated by introspecting the table and view schemas and generating the UPDATE SET list explicitly, assigning NULL for any target-only columns
Caution:
This should work, but it adds complexity, dynamic SQL, and more room for errors in larger pipelines.
โ01-10-2026 09:43 AM
Hi @ManojkMohan
Can you identify the possible errors in larger pipelines? How could I achieve the required output without the above code? Your reply is not very understandable.
The reason behind my feature request is as follows: The source is a NoSQL database and the schema coming from it is uncertain. Any new data I get should be written into the table with the previous data overwritten if key already present. Since schema is uncertain and the behaviour required on merge keys is overwrite, I have written this pipeline.
โ01-13-2026 09:33 AM
The introspection-based approach like spark.table().columns introduces fragility
Alternatives
Full Replace Partition (Bronze Layer)
Column Mapping + Rename/Drop (Metadata-Only):
Enable on table: ALTER TABLE test_table SET TBLPROPERTIES (delta.columnMapping.mode = 'name');
Then MERGE ... UPDATE SET * ignores physical mismatches
https://docs.databricks.com/aws/en/delta/column-mapping
An option like MERGE ... UPDATE SET * NULL_MISSING would standardize NoSQL-to-Delta syncs cutting custom logic by 80% vs. delete+insert overhead (~2x write amp). Docs confirm UPDATE SET * skips missing source cols intentionally โnulling them aligns with INSERT * semantics for true overwrites
โ01-13-2026 10:04 PM - edited โ01-13-2026 10:09 PM
Hi @ManojkMohan
Thanks for pointing out the issues. I now get the pain points of writing schema dynamically. The issue with schema read while concurrent writes are happening. And in streaming / DLT tables the schema reads for each write would add to a huge cost.
But, the solution you provided also has its own drawbacks:
Also, what is the benefit of column mapping? It focuses on renaming and deleting columns, neither of which I'm doing here.
Keeping in mind your points, I have these solutions in mind
MERGE WITH SCHEMA EVOLUTION INTO test_table target
USING v_final source
ON target.id = source.id
WHEN MATCHED THEN DELETE;
MERGE WITH SCHEMA EVOLUTION INTO test_table target
USING v_final source
ON target.id = source.id
WHEN NOT MATCHED THEN INSERT *;โMERGE WITH SCHEMA EVOLUTION INTO test_table target
USING (
SELECT *, 'delete' AS mode FROM test_view
UNION
SELECT *, 'insert' AS mode FROM test_view
) AS source
ON target.id = source.id AND source.mode = 'delete'
WHEN MATCHED THEN DELETE
WHEN NOT MATCHED THEN INSERT * EXCEPT (mode);โIf you have any other approaches in mind, would like to hear that.
As of now, I have a batch setup and can ensure there are no concurrent writes happening on the table. So, breaking down the merge into two parts will just double the time.
2 weeks ago
Hi @Dhruv-22,
This is a valid use case and your workaround is solid. Let me share some context on the current behavior and a few approaches that may help streamline things.
CURRENT BEHAVIOR WITH SCHEMA EVOLUTION
When automatic schema evolution is enabled and you use MERGE ... WHEN MATCHED THEN UPDATE SET *, columns that exist in the target but not in the source are left unchanged. By contrast, WHEN NOT MATCHED THEN INSERT * does set target-only columns to NULL for new rows. This asymmetry is by design: the assumption is that an UPDATE should only modify the columns provided in the source, while an INSERT needs to fill all columns in a new row.
The documentation covers this here:
https://docs.databricks.com/en/delta/update-schema.html#merge-schema-evolution
WORKAROUND OPTIONS
Your dynamic MERGE approach is the right pattern for now. Here is a slightly more compact version that also handles the NOT MATCHED clause cleanly:
table_cols = spark.table("edw_nprd_aen.bronze.test_table").columns
view_cols = spark.table("test_view").columns
set_clause = ", ".join(
[f"{col} = test_view.{col}" if col in view_cols else f"{col} = NULL"
for col in table_cols]
)
merge_sql = f"""
MERGE INTO edw_nprd_aen.bronze.test_table
USING test_view
ON test_table.id = test_view.id
WHEN MATCHED THEN UPDATE SET {set_clause}
WHEN NOT MATCHED THEN INSERT *
"""
spark.sql(merge_sql)
Note that you do not need MERGE WITH SCHEMA EVOLUTION in this case, since you are explicitly listing all columns rather than relying on the wildcard expansion. Schema evolution is only needed when you want Spark to automatically add new source columns to the target schema.
If you use the Python Delta Lake API instead of SQL, you can achieve the same thing with a dictionary:
from delta.tables import DeltaTable
dt = DeltaTable.forName(spark, "edw_nprd_aen.bronze.test_table")
update_map = {
col: f"source.{col}" if col in view_cols else "NULL"
for col in table_cols
}
dt.alias("target").merge(
spark.table("test_view").alias("source"),
"target.id = source.id"
).whenMatchedUpdate(set=update_map) \
.whenNotMatchedInsertAll() \
.execute()
FEATURE REQUEST
Your proposed syntax (UPDATE SET * NULL IF NOT PRESENT) is a reasonable idea. If you would like this considered by the product team, I recommend submitting it through the Databricks Ideas portal:
https://ideas.databricks.com
Ideas that get upvotes there are reviewed and prioritized by product engineering.
SUMMARY
- UPDATE SET * with schema evolution intentionally preserves target-only columns (leaves them unchanged).
- INSERT * with schema evolution does set target-only columns to NULL.
- For your use case, dynamically building the SET clause (as you are already doing) is the recommended approach.
- Submit the feature idea to https://ideas.databricks.com for visibility with the product team.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.