Databricks Community

6502 · ‎03-19-2025

Let me be crystal clear: Schema Change and OpenSeach do not fit well together. However, the data pushed to it are processed and always have the same schema. The problem here is that Spark is reading a CDC feed, which is subject to Schema Change because the source table may be changed.

I attempted to solve the issue by providing the mergeSchema and schemaTrackingLocation. I think these settings are useful to Spark for the checkpoint data.

But it is not working, the code keeps failing with:

com.databricks.sql.transaction.tahoe.DeltaStreamingColumnMappingSchemaIncompatibleException: Streaming read is not supported on tables with read-incompatible schema changes (e.g. rename or drop or datatype changes).
Please provide a 'schemaTrackingLocation' to enable non-additive schema evolution for Delta stream processing.

The above error is thrown for this schema change detection. Please note that the source table has a delta.columMapping on ID enabled. This makes the diff larger, however, only a new field has been added in additive way.

@@ -178,39 +249,64 @@
       "type": "integer",
       "nullable": true,
       "metadata": {
-        "comment": "Day extraction from `action_ts`"
+        "comment": "Day extraction from `action_ts`",
+        "delta.columnMapping.id": 27,
+        "delta.columnMapping.physicalName": "day"
       }
     },
     {
       "name": "merchant_shared_request_id",
       "type": "string",
       "nullable": true,
-      "metadata": {}
+      "metadata": {
+        "delta.columnMapping.id": 28,
+        "delta.columnMapping.physicalName": "merchant_shared_request_id"
+      }
     },
     {
       "name": "merchant_nsid",
       "type": "string",
       "nullable": true,
-      "metadata": {}
+      "metadata": {
+        "delta.columnMapping.id": 29,
+        "delta.columnMapping.physicalName": "merchant_nsid"
+      }
     },
     {
       "name": "refunded_on_behalf_of",
       "type": "string",
       "nullable": true,
-      "metadata": {}
-    },parse error: Invalid numeric literal at line 1, column 2833
-
+      "metadata": {
+        "delta.columnMapping.id": 30,
+        "delta.columnMapping.physicalName": "refunded_on_behalf_of"
+      }
+    },
     {
       "name": "payment_provider_to_merchant",
       "type": "string",
       "nullable": true,
-      "metadata": {}
+      "metadata": {
+        "delta.columnMapping.id": 31,
+        "delta.columnMapping.physicalName": "payment_provider_to_merchant"
+      }
     },
     {
       "name": "idempotency",
       "type": "string",
       "nullable": true,
-      "metadata": {}
+      "metadata": {
+        "delta.columnMapping.id": 34,
+        "delta.columnMapping.physicalName": "col-eeea8bdf-5e74-4088-8d9e-208fd9e55014"
+      }
+    },
+    {
+      "name": "payment_provider_operation_id",
+      "type": "string",
+      "nullable": true,
+      "metadata": {
+        "delta.columnMapping.id": 35,
+        "delta.columnMapping.physicalName": "col-a4b6a352-73cf-4af5-aae6-364c57d6a4cf"
+      }
     }
   ]
 }

I can handle the schema change manually, however, it will be much better in an automatic fashion.

Any idea?

mark_ott · 4 weeks ago

You are encountering a common issue in Databricks Delta Lake streaming when working with Change Data Capture (CDC) feeds: schema evolution, especially with column mapping enabled, is not fully supported automatically in streaming reads—that includes additive-only schema changes (like adding columns) when column mapping is turned on. This limitation persists even when you have set mergeSchema and schemaTrackingLocation, as you have correctly noticed.

Here’s a direct answer, then suggestions and workarounds:

Direct Answer

Delta Lake streaming currently does not, as of late 2025, fully support automatic schema evolution (even additive) for streaming reads from tables with Delta column mapping enabled. The requirement from the error (“please provide a 'schemaTrackingLocation'…”) only applies to non-additive changes if your table does NOT have column mapping turned on. With column mapping, non-breaking schema changes can still cause failures, particularly on streaming reads.

Why Your Solution Isn’t Working

mergeSchema and schemaTrackingLocation help Spark recognize and track schema changes, but only support limited scenarios (primarily for batch reads or when column mapping is NOT enabled).
With column mapping on, Delta Lake can tolerate schema changes in batch (using column IDs to map fields), but streaming queries are more strict because they need consistent column structure throughout the stream run.
Additive schema changes (like adding new columns) can break an active structured streaming query—even with those settings—if Spark detects incompatible schema evolution via mapping IDs.
The error you see is typical: even if only new fields are added, underlying column mapping differences between checkpoints and the new table version can cause failures due to internal mapping ID mismatches.

Workarounds and Recommendations

1. Manual Schema Management (Best Reliability)

If your data pipeline has some flexibility, you can:

Stop the current stream when schema changes (and the schema incompatibility exception occurs)
Restart the streaming job (checkpointed to a new location), so Spark picks up the new schema and mapping IDs cleanly

2. Remove Column Mapping (If Possible)

If strict column mapping isn’t required for your table, consider disabling it. Additive-only schema changes are more easily handled by Spark’s schema evolution settings in this mode. Batch reads are a lot more forgiving.

3. Downstream Schema Shim (Dynamic Column Selection)

You can read all columns as a struct (or as a binary blob), and manually flatten/select columns in your downstream logic, to ignore new columns and prevent column mapping issues.

Example (PySpark):

python

raw_df = spark.readStream.format("delta").load(".../table") column_names = [col for col in raw_df.columns if col in allowed_columns] processed_df = raw_df.select(*column_names) # Now proceed as normal

This approach requires code to handle new columns intentionally, so your stream isn’t broken by surprise additions.

4. Auto-Restart with Monitoring

Build external monitoring that detects this schema exception and restarts the streaming job automatically with updated schema (pointing to a new checkpoint). This is closest to “automatic” as can be achieved, since true schema-evolving streaming isn’t yet supported with column mapping.

Key Takeaway

As of November 2025, full auto schema evolution for streaming on Delta tables with column mapping is not supported—the approach you tried only works for non-mapped tables, and only for strictly additive changes. The only robust “automatic” path involves detecting schema exceptions and restarting the stream, or shifting your architecture to batch writes, or disabling column mapping if you do not strongly require it.