<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to handle MERGE with Schema Evolution in Delta Lake in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-handle-merge-with-schema-evolution-in-delta-lake/m-p/156413#M54420</link>
    <description>&lt;P&gt;How to handle MERGE with Schema Evolution in Delta Lake&lt;BR /&gt;Hi everyone,&lt;BR /&gt;Schema evolution during MERGE is one of the trickiest parts of building robust Delta Lake pipelines. Databricks actually has a native SQL syntax for this — plus Python API options for programmatic pipelines. Here's a complete guide.&lt;/P&gt;&lt;P&gt;The Cleanest Way — Native SQL Syntax (Databricks DBR 12.2+)&lt;BR /&gt;Databricks added a dedicated WITH SCHEMA EVOLUTION clause directly in the MERGE statement. No session configs needed:&lt;BR /&gt;sqlMERGE WITH SCHEMA EVOLUTION INTO target_table AS t&lt;BR /&gt;USING source_table AS s&lt;BR /&gt;ON t.id = s.id&lt;BR /&gt;WHEN MATCHED THEN UPDATE SET *&lt;BR /&gt;WHEN NOT MATCHED THEN INSERT *&lt;BR /&gt;That's it. Any new columns in source_table are automatically added to target_table. Clean, readable, and production-safe.&lt;/P&gt;&lt;P&gt;Scenario 1 — New columns added to source&lt;BR /&gt;sql-- source_table now has a new column: loyalty_tier&lt;BR /&gt;MERGE WITH SCHEMA EVOLUTION INTO my_catalog.my_schema.customers AS t&lt;BR /&gt;USING staging.customers_updates AS s&lt;BR /&gt;ON t.id = s.id&lt;BR /&gt;WHEN MATCHED THEN&lt;BR /&gt;UPDATE SET&lt;BR /&gt;t.name = s.name,&lt;BR /&gt;t.email = s.email,&lt;BR /&gt;t.updated_at = s.updated_at,&lt;BR /&gt;t.loyalty_tier = s.loyalty_tier -- new column, auto-added to target&lt;BR /&gt;WHEN NOT MATCHED THEN&lt;BR /&gt;INSERT *&lt;BR /&gt;After execution, loyalty_tier is permanently added to the target schema. Existing rows get NULL.&lt;/P&gt;&lt;P&gt;Scenario 2 — Python API (when you need programmatic control)&lt;BR /&gt;For Python-based pipelines, enable autoMerge at session level:&lt;BR /&gt;pythonfrom delta.tables import DeltaTable&lt;/P&gt;&lt;P&gt;spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")&lt;/P&gt;&lt;P&gt;target = DeltaTable.forName(spark, "my_catalog.my_schema.customers")&lt;/P&gt;&lt;P&gt;target.alias("t").merge(&lt;BR /&gt;source_df.alias("s"),&lt;BR /&gt;"t.id = s.id"&lt;BR /&gt;).whenMatchedUpdateAll() \&lt;BR /&gt;.whenNotMatchedInsertAll() \&lt;BR /&gt;.execute()&lt;/P&gt;&lt;P&gt;Scenario 3 — Column type change (e.g., INT → BIGINT)&lt;BR /&gt;WITH SCHEMA EVOLUTION does NOT handle type changes. Cast explicitly in the source:&lt;BR /&gt;sqlMERGE WITH SCHEMA EVOLUTION INTO my_catalog.my_schema.customers AS t&lt;BR /&gt;USING (&lt;BR /&gt;SELECT&lt;BR /&gt;id,&lt;BR /&gt;name,&lt;BR /&gt;CAST(order_count AS BIGINT) AS order_count -- was INT in target&lt;BR /&gt;FROM staging.customers_updates&lt;BR /&gt;) AS s&lt;BR /&gt;ON t.id = s.id&lt;BR /&gt;WHEN MATCHED THEN UPDATE SET *&lt;BR /&gt;WHEN NOT MATCHED THEN INSERT *&lt;/P&gt;&lt;P&gt;&lt;span class="lia-unicode-emoji" title=":warning:"&gt;⚠️&lt;/span&gt; Type widening (INT → BIGINT, FLOAT → DOUBLE) is safe. Narrowing will still fail.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Scenario 4 — Streaming pipeline with foreachBatch&lt;BR /&gt;pythonspark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")&lt;/P&gt;&lt;P&gt;target = DeltaTable.forName(spark, "my_catalog.my_schema.customers")&lt;/P&gt;&lt;P&gt;def merge_with_schema_evolution(batch_df, batch_id):&lt;BR /&gt;target.alias("t").merge(&lt;BR /&gt;batch_df.alias("s"),&lt;BR /&gt;"t.id = s.id"&lt;BR /&gt;).whenMatchedUpdateAll() \&lt;BR /&gt;.whenNotMatchedInsertAll() \&lt;BR /&gt;.execute()&lt;/P&gt;&lt;P&gt;query = (&lt;BR /&gt;spark.readStream&lt;BR /&gt;.format("delta")&lt;BR /&gt;.table("staging.customers_stream")&lt;BR /&gt;.writeStream&lt;BR /&gt;.foreachBatch(merge_with_schema_evolution)&lt;BR /&gt;.option("checkpointLocation", "/checkpoints/customers_merge")&lt;BR /&gt;.trigger(availableNow=True)&lt;BR /&gt;.start()&lt;BR /&gt;)&lt;/P&gt;</description>
    <pubDate>Thu, 07 May 2026 19:55:03 GMT</pubDate>
    <dc:creator>DushendRaghavan</dc:creator>
    <dc:date>2026-05-07T19:55:03Z</dc:date>
    <item>
      <title>How to handle MERGE with Schema Evolution in Delta Lake</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-handle-merge-with-schema-evolution-in-delta-lake/m-p/156413#M54420</link>
      <description>&lt;P&gt;How to handle MERGE with Schema Evolution in Delta Lake&lt;BR /&gt;Hi everyone,&lt;BR /&gt;Schema evolution during MERGE is one of the trickiest parts of building robust Delta Lake pipelines. Databricks actually has a native SQL syntax for this — plus Python API options for programmatic pipelines. Here's a complete guide.&lt;/P&gt;&lt;P&gt;The Cleanest Way — Native SQL Syntax (Databricks DBR 12.2+)&lt;BR /&gt;Databricks added a dedicated WITH SCHEMA EVOLUTION clause directly in the MERGE statement. No session configs needed:&lt;BR /&gt;sqlMERGE WITH SCHEMA EVOLUTION INTO target_table AS t&lt;BR /&gt;USING source_table AS s&lt;BR /&gt;ON t.id = s.id&lt;BR /&gt;WHEN MATCHED THEN UPDATE SET *&lt;BR /&gt;WHEN NOT MATCHED THEN INSERT *&lt;BR /&gt;That's it. Any new columns in source_table are automatically added to target_table. Clean, readable, and production-safe.&lt;/P&gt;&lt;P&gt;Scenario 1 — New columns added to source&lt;BR /&gt;sql-- source_table now has a new column: loyalty_tier&lt;BR /&gt;MERGE WITH SCHEMA EVOLUTION INTO my_catalog.my_schema.customers AS t&lt;BR /&gt;USING staging.customers_updates AS s&lt;BR /&gt;ON t.id = s.id&lt;BR /&gt;WHEN MATCHED THEN&lt;BR /&gt;UPDATE SET&lt;BR /&gt;t.name = s.name,&lt;BR /&gt;t.email = s.email,&lt;BR /&gt;t.updated_at = s.updated_at,&lt;BR /&gt;t.loyalty_tier = s.loyalty_tier -- new column, auto-added to target&lt;BR /&gt;WHEN NOT MATCHED THEN&lt;BR /&gt;INSERT *&lt;BR /&gt;After execution, loyalty_tier is permanently added to the target schema. Existing rows get NULL.&lt;/P&gt;&lt;P&gt;Scenario 2 — Python API (when you need programmatic control)&lt;BR /&gt;For Python-based pipelines, enable autoMerge at session level:&lt;BR /&gt;pythonfrom delta.tables import DeltaTable&lt;/P&gt;&lt;P&gt;spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")&lt;/P&gt;&lt;P&gt;target = DeltaTable.forName(spark, "my_catalog.my_schema.customers")&lt;/P&gt;&lt;P&gt;target.alias("t").merge(&lt;BR /&gt;source_df.alias("s"),&lt;BR /&gt;"t.id = s.id"&lt;BR /&gt;).whenMatchedUpdateAll() \&lt;BR /&gt;.whenNotMatchedInsertAll() \&lt;BR /&gt;.execute()&lt;/P&gt;&lt;P&gt;Scenario 3 — Column type change (e.g., INT → BIGINT)&lt;BR /&gt;WITH SCHEMA EVOLUTION does NOT handle type changes. Cast explicitly in the source:&lt;BR /&gt;sqlMERGE WITH SCHEMA EVOLUTION INTO my_catalog.my_schema.customers AS t&lt;BR /&gt;USING (&lt;BR /&gt;SELECT&lt;BR /&gt;id,&lt;BR /&gt;name,&lt;BR /&gt;CAST(order_count AS BIGINT) AS order_count -- was INT in target&lt;BR /&gt;FROM staging.customers_updates&lt;BR /&gt;) AS s&lt;BR /&gt;ON t.id = s.id&lt;BR /&gt;WHEN MATCHED THEN UPDATE SET *&lt;BR /&gt;WHEN NOT MATCHED THEN INSERT *&lt;/P&gt;&lt;P&gt;&lt;span class="lia-unicode-emoji" title=":warning:"&gt;⚠️&lt;/span&gt; Type widening (INT → BIGINT, FLOAT → DOUBLE) is safe. Narrowing will still fail.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Scenario 4 — Streaming pipeline with foreachBatch&lt;BR /&gt;pythonspark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")&lt;/P&gt;&lt;P&gt;target = DeltaTable.forName(spark, "my_catalog.my_schema.customers")&lt;/P&gt;&lt;P&gt;def merge_with_schema_evolution(batch_df, batch_id):&lt;BR /&gt;target.alias("t").merge(&lt;BR /&gt;batch_df.alias("s"),&lt;BR /&gt;"t.id = s.id"&lt;BR /&gt;).whenMatchedUpdateAll() \&lt;BR /&gt;.whenNotMatchedInsertAll() \&lt;BR /&gt;.execute()&lt;/P&gt;&lt;P&gt;query = (&lt;BR /&gt;spark.readStream&lt;BR /&gt;.format("delta")&lt;BR /&gt;.table("staging.customers_stream")&lt;BR /&gt;.writeStream&lt;BR /&gt;.foreachBatch(merge_with_schema_evolution)&lt;BR /&gt;.option("checkpointLocation", "/checkpoints/customers_merge")&lt;BR /&gt;.trigger(availableNow=True)&lt;BR /&gt;.start()&lt;BR /&gt;)&lt;/P&gt;</description>
      <pubDate>Thu, 07 May 2026 19:55:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-handle-merge-with-schema-evolution-in-delta-lake/m-p/156413#M54420</guid>
      <dc:creator>DushendRaghavan</dc:creator>
      <dc:date>2026-05-07T19:55:03Z</dc:date>
    </item>
    <item>
      <title>Re: How to handle MERGE with Schema Evolution in Delta Lake</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-handle-merge-with-schema-evolution-in-delta-lake/m-p/156464#M54427</link>
      <description>&lt;P&gt;Great post. Would also like to consider the following points:&lt;/P&gt;&lt;P&gt;Guardrails: schema evolution is powerful — it can also accidentally add garbage columns if upstream sends unexpected fields.&lt;BR /&gt;Recommendation: validate/allowlist schema changes in higher environments before promoting to prod (especially for critical tables).&lt;BR /&gt;Observability: log schema diffs when evolution occurs (e.g., compare batch_df.schema with table schema).&lt;/P&gt;&lt;P&gt;This increases trust and helps teams adopt it confidently.&lt;/P&gt;</description>
      <pubDate>Fri, 08 May 2026 16:43:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-handle-merge-with-schema-evolution-in-delta-lake/m-p/156464#M54427</guid>
      <dc:creator>nayan_wylde</dc:creator>
      <dc:date>2026-05-08T16:43:34Z</dc:date>
    </item>
  </channel>
</rss>

