cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Schema change and OpenSearch

6502
New Contributor III

Let me be crystal clear: Schema Change and OpenSeach do not fit well together. However, the data pushed to it are processed and always have the same schema. The problem here is that Spark is reading a CDC feed, which is subject to Schema Change because the source table may be changed. 

I attempted to solve the issue by providing the mergeSchema and schemaTrackingLocation. I think these settings are useful to Spark for the checkpoint data. 

But it is not working, the code keeps failing with: 

com.databricks.sql.transaction.tahoe.DeltaStreamingColumnMappingSchemaIncompatibleException: Streaming read is not supported on tables with read-incompatible schema changes (e.g. rename or drop or datatype changes).
Please provide a 'schemaTrackingLocation' to enable non-additive schema evolution for Delta stream processing.
 
The above error is thrown for this schema change detection. Please note that the source table has a delta.columMapping on ID enabled.  This makes the diff larger, however, only a new field has been added in additive way. 
 
@@ -178,39 +249,64 @@
       "type": "integer",
       "nullable": true,
       "metadata": {
-        "comment": "Day extraction from `action_ts`"
+        "comment": "Day extraction from `action_ts`",
+        "delta.columnMapping.id": 27,
+        "delta.columnMapping.physicalName": "day"
       }
     },
     {
       "name": "merchant_shared_request_id",
       "type": "string",
       "nullable": true,
-      "metadata": {}
+      "metadata": {
+        "delta.columnMapping.id": 28,
+        "delta.columnMapping.physicalName": "merchant_shared_request_id"
+      }
     },
     {
       "name": "merchant_nsid",
       "type": "string",
       "nullable": true,
-      "metadata": {}
+      "metadata": {
+        "delta.columnMapping.id": 29,
+        "delta.columnMapping.physicalName": "merchant_nsid"
+      }
     },
     {
       "name": "refunded_on_behalf_of",
       "type": "string",
       "nullable": true,
-      "metadata": {}
-    },parse error: Invalid numeric literal at line 1, column 2833
-
+      "metadata": {
+        "delta.columnMapping.id": 30,
+        "delta.columnMapping.physicalName": "refunded_on_behalf_of"
+      }
+    },
     {
       "name": "payment_provider_to_merchant",
       "type": "string",
       "nullable": true,
-      "metadata": {}
+      "metadata": {
+        "delta.columnMapping.id": 31,
+        "delta.columnMapping.physicalName": "payment_provider_to_merchant"
+      }
     },
     {
       "name": "idempotency",
       "type": "string",
       "nullable": true,
-      "metadata": {}
+      "metadata": {
+        "delta.columnMapping.id": 34,
+        "delta.columnMapping.physicalName": "col-eeea8bdf-5e74-4088-8d9e-208fd9e55014"
+      }
+    },
+    {
+      "name": "payment_provider_operation_id",
+      "type": "string",
+      "nullable": true,
+      "metadata": {
+        "delta.columnMapping.id": 35,
+        "delta.columnMapping.physicalName": "col-a4b6a352-73cf-4af5-aae6-364c57d6a4cf"
+      }
     }
   ]
 }
 
I can handle the schema change manually, however, it will be much better in an automatic fashion.
Any idea?
1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

You are encountering a common issue in Databricks Delta Lake streaming when working with Change Data Capture (CDC) feeds: schema evolution, especially with column mapping enabled, is not fully supported automatically in streaming readsโ€”that includes additive-only schema changes (like adding columns) when column mapping is turned on. This limitation persists even when you have set mergeSchema and schemaTrackingLocation, as you have correctly noticed.

Hereโ€™s a direct answer, then suggestions and workarounds:

Direct Answer

Delta Lake streaming currently does not, as of late 2025, fully support automatic schema evolution (even additive) for streaming reads from tables with Delta column mapping enabled. The requirement from the error (โ€œplease provide a 'schemaTrackingLocation'โ€ฆโ€) only applies to non-additive changes if your table does NOT have column mapping turned on. With column mapping, non-breaking schema changes can still cause failures, particularly on streaming reads.

Why Your Solution Isnโ€™t Working

  • mergeSchema and schemaTrackingLocation help Spark recognize and track schema changes, but only support limited scenarios (primarily for batch reads or when column mapping is NOT enabled).

  • With column mapping on, Delta Lake can tolerate schema changes in batch (using column IDs to map fields), but streaming queries are more strict because they need consistent column structure throughout the stream run.

  • Additive schema changes (like adding new columns) can break an active structured streaming queryโ€”even with those settingsโ€”if Spark detects incompatible schema evolution via mapping IDs.

  • The error you see is typical: even if only new fields are added, underlying column mapping differences between checkpoints and the new table version can cause failures due to internal mapping ID mismatches.

Workarounds and Recommendations

1. Manual Schema Management (Best Reliability)

If your data pipeline has some flexibility, you can:

  • Stop the current stream when schema changes (and the schema incompatibility exception occurs)

  • Restart the streaming job (checkpointed to a new location), so Spark picks up the new schema and mapping IDs cleanly

2. Remove Column Mapping (If Possible)

If strict column mapping isnโ€™t required for your table, consider disabling it. Additive-only schema changes are more easily handled by Sparkโ€™s schema evolution settings in this mode. Batch reads are a lot more forgiving.

3. Downstream Schema Shim (Dynamic Column Selection)

You can read all columns as a struct (or as a binary blob), and manually flatten/select columns in your downstream logic, to ignore new columns and prevent column mapping issues.

  • Example (PySpark):

    python
    raw_df = spark.readStream.format("delta").load(".../table") column_names = [col for col in raw_df.columns if col in allowed_columns] processed_df = raw_df.select(*column_names) # Now proceed as normal

    This approach requires code to handle new columns intentionally, so your stream isnโ€™t broken by surprise additions.

4. Auto-Restart with Monitoring

Build external monitoring that detects this schema exception and restarts the streaming job automatically with updated schema (pointing to a new checkpoint). This is closest to โ€œautomaticโ€ as can be achieved, since true schema-evolving streaming isnโ€™t yet supported with column mapping.

Key Takeaway

As of November 2025, full auto schema evolution for streaming on Delta tables with column mapping is not supportedโ€”the approach you tried only works for non-mapped tables, and only for strictly additive changes. The only robust โ€œautomaticโ€ path involves detecting schema exceptions and restarting the stream, or shifting your architecture to batch writes, or disabling column mapping if you do not strongly require it.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now