Databricks Community

Dhruv-22 · ‎01-09-2026

I want to raise a feature request as follows.

Currently, in the Automatic schema evolution for merge when a column is not present in the source dataset it is not changed in the target dataset. For e.g.

%sql
CREATE OR REPLACE TABLE edw_nprd_aen.bronze.test_table (
  id INT
  , assignments ARRAY<STRING>
);

INSERT INTO edw_nprd_aen.bronze.test_table
VALUES 
  ( 1, ARRAY('S1-1', 'S1-2') )
  , (2, NULL)
  , (3, ARRAY('S3-1'))
  , (4, ARRAY('S4-1'));

SELECT *
FROM edw_nprd_aen.bronze.test_table;

%sql
CREATE OR REPLACE TEMPORARY VIEW test_view (id) AS VALUES (1), (3);

SELECT *
FROM test_view

%sql
MERGE WITH SCHEMA EVOLUTION INTO edw_nprd_aen.bronze.test_table
USING test_view
ON test_table.id = test_view.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

SELECT *
FROM edw_nprd_aen.bronze.test_table
ORDER BY id

---

What I want is that if the row matches and the column is not present then set it to null. Like the following

Currently, I'm achieving this as follows

table_cols = spark.table('edw_nprd_aen.bronze.test_table').columns
view_cols = spark.table('test_view').columns

merge_string = f"""
MERGE WITH SCHEMA EVOLUTION INTO edw_nprd_aen.bronze.test_table
USING test_view
ON test_table.id = test_view.id
WHEN MATCHED THEN UPDATE SET
{'\n, '.join([f"{col} = test_view.{col}" if col in view_cols else f"{col} = NULL" for col in table_cols])}
WHEN NOT MATCHED THEN INSERT *
"""

print(merge_string)
spark.sql(merge_string)

It will be great if there could be some in-built option for the same, like

UPDATE SET * NULL IF NOT PRESENT

ManojkMohan · ‎01-10-2026

@Dhruv-22

Problem

When using MERGE INTO ... WITH SCHEMA EVOLUTION, if a column exists in the target table but is not present in the source dataset, that column is left unchanged on matched rows.

Solution Thinking

This can be emulated by introspecting the table and view schemas and generating the UPDATE SET list explicitly, assigning NULL for any target-only columns

Caution:

This should work, but it adds complexity, dynamic SQL, and more room for errors in larger pipelines.

Dhruv-22 · ‎01-10-2026

Hi @ManojkMohan

Can you identify the possible errors in larger pipelines? How could I achieve the required output without the above code? Your reply is not very understandable.

The reason behind my feature request is as follows: The source is a NoSQL database and the schema coming from it is uncertain. Any new data I get should be written into the table with the previous data overwritten if key already present. Since schema is uncertain and the behaviour required on merge keys is overwrite, I have written this pipeline.

ManojkMohan · ‎01-13-2026

@Dhruv-22

The introspection-based approach like spark.table().columns introduces fragility

Concurrency conflicts: Schema reads during MERGE can race with parallel writes, leading to partial column lists
Column explosion/mutation: NoSQL sources evolve rapidly https://docs.databricks.com/aws/en/delta/column-mapping
Performance overhead: Runtime schema diffs on TB-scale tables add 10-30s latency per MERGE (from catalog scans); in streaming/DLT, this cascades to checkpoint lags
Error prone quoting/escaping: Dynamic interpolation mishandles reserved names (e.g., id, date)

Alternatives

Delete+Insert (Recommended for NoSQL Overwrite)

Full Replace Partition (Bronze Layer)

Column Mapping + Rename/Drop (Metadata-Only):
Enable on table: ALTER TABLE test_table SET TBLPROPERTIES (delta.columnMapping.mode = 'name');
Then MERGE ... UPDATE SET * ignores physical mismatches

https://docs.databricks.com/aws/en/delta/column-mapping

An option like MERGE ... UPDATE SET * NULL_MISSING would standardize NoSQL-to-Delta syncs cutting custom logic by 80% vs. delete+insert overhead (~2x write amp). Docs confirm UPDATE SET * skips missing source cols intentionally —nulling them aligns with INSERT * semantics for true overwrites

Dhruv-22 · ‎01-13-2026

Hi @ManojkMohan

Thanks for pointing out the issues. I now get the pain points of writing schema dynamically. The issue with schema read while concurrent writes are happening. And in streaming / DLT tables the schema reads for each write would add to a huge cost.

But, the solution you provided also has its own drawbacks:

Insert statement doesn't support schema evolution. Insert statement only allows new columns to be present at the end, which I can't control.
Introducing two writes doesn't have a single point of failure. What if one statement ran while the other failed? Like DELETE ran but INSERT failed? We would have an intermediate state of data which is not usable
Also, introducing two writes doubles the time of the operation.

Also, what is the benefit of column mapping? It focuses on renaming and deleting columns, neither of which I'm doing here.

Keeping in mind your points, I have these solutions in mind

Use merge to insert data. But this also has the issue of not having single point of failure.

MERGE WITH SCHEMA EVOLUTION INTO test_table target
USING v_final source
ON target.id = source.id
WHEN MATCHED THEN DELETE;

MERGE WITH SCHEMA EVOLUTION INTO test_table target
USING v_final source
ON target.id = source.id
WHEN NOT MATCHED THEN INSERT *;

Double the data, using one set for deletes and another for writes. This helps in building a single point of failure. But doubling of data would not be suitable when we are dealing with huge volumes of data.

MERGE WITH SCHEMA EVOLUTION INTO test_table target
USING (
  SELECT *, 'delete' AS mode FROM test_view
  UNION
  SELECT *, 'insert' AS mode FROM test_view
) AS source
ON target.id = source.id AND source.mode = 'delete'
WHEN MATCHED THEN DELETE
WHEN NOT MATCHED THEN INSERT * EXCEPT (mode);

If you have any other approaches in mind, would like to hear that.

As of now, I have a batch setup and can ensure there are no concurrent writes happening on the table. So, breaking down the merge into two parts will just double the time.

SteveOstrowski · ‎03-08-2026

Hi @Dhruv-22,

This is a valid use case and your workaround is solid. Let me share some context on the current behavior and a few approaches that may help streamline things.

CURRENT BEHAVIOR WITH SCHEMA EVOLUTION

When automatic schema evolution is enabled and you use MERGE ... WHEN MATCHED THEN UPDATE SET *, columns that exist in the target but not in the source are left unchanged. By contrast, WHEN NOT MATCHED THEN INSERT * does set target-only columns to NULL for new rows. This asymmetry is by design: the assumption is that an UPDATE should only modify the columns provided in the source, while an INSERT needs to fill all columns in a new row.

The documentation covers this here:
https://docs.databricks.com/en/delta/update-schema.html#merge-schema-evolution

WORKAROUND OPTIONS

Your dynamic MERGE approach is the right pattern for now. Here is a slightly more compact version that also handles the NOT MATCHED clause cleanly:

table_cols = spark.table("edw_nprd_aen.bronze.test_table").columns
view_cols = spark.table("test_view").columns

set_clause = ", ".join(
    [f"{col} = test_view.{col}" if col in view_cols else f"{col} = NULL"
     for col in table_cols]
)

merge_sql = f"""
MERGE INTO edw_nprd_aen.bronze.test_table
USING test_view
ON test_table.id = test_view.id
WHEN MATCHED THEN UPDATE SET {set_clause}
WHEN NOT MATCHED THEN INSERT *
"""

spark.sql(merge_sql)

Note that you do not need MERGE WITH SCHEMA EVOLUTION in this case, since you are explicitly listing all columns rather than relying on the wildcard expansion. Schema evolution is only needed when you want Spark to automatically add new source columns to the target schema.

If you use the Python Delta Lake API instead of SQL, you can achieve the same thing with a dictionary:

from delta.tables import DeltaTable

dt = DeltaTable.forName(spark, "edw_nprd_aen.bronze.test_table")

update_map = {
    col: f"source.{col}" if col in view_cols else "NULL"
    for col in table_cols
}

dt.alias("target").merge(
    spark.table("test_view").alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(set=update_map) \
 .whenNotMatchedInsertAll() \
 .execute()

FEATURE REQUEST

Your proposed syntax (UPDATE SET * NULL IF NOT PRESENT) is a reasonable idea. If you would like this considered by the product team, I recommend submitting it through the Databricks Ideas portal:
https://ideas.databricks.com

Ideas that get upvotes there are reviewed and prioritized by product engineering.

SUMMARY

- UPDATE SET * with schema evolution intentionally preserves target-only columns (leaves them unchanged).
- INSERT * with schema evolution does set target-only columns to NULL.
- For your use case, dynamically building the SET clause (as you are already doing) is the recommended approach.
- Submit the feature idea to https://ideas.databricks.com for visibility with the product team.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.

Databricks Community

Feature request: Allow to set value as null when not present in schema evolution

‌✨‌ DAIS 2026 Community Virtual Contest – Winners Announced! 🏆

DAIS 2026 Day 3 - Last Day, Make It Count

DAIS 2026 Day 2 Recap: Megaphones, Legos, and the volunteers who carried it 🙌

🌟 Community Pulse: Your Weekly Roundup! June 08 – 14, 2026

Solution Accelerator Series | Building a Chatbot With Large Language Models (LLMs)