cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Schema Evolution with "schemaTrackingLocation" fails anyway

MingOnCloud
New Contributor II

Hi, I'm trying to understand the usage of "schemaTrackLocation" with schema evolution.

I use these articles as references:

https://docs.delta.io/latest/delta-streaming.html#tracking-non-additive-schema-changes

https://docs.databricks.com/aws/en/error-messages/error-classes#delta_streaming_schema_location_not_...

What I want to do is "simple":

I create readstream on source delta table,

I writeStream to target table with the same schema without transformation on data,

What I want to achieve: with the option("schemaTrackingLocation", "true") I want to drop a column in source table and the streaming process continue without error.

I work on Academy Lab with a All-purpose compute, Runtime 15.4, Python notebook.This is my code.

spark.sql(f"ALTER TABLE people_source SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name')")

spark.sql(f"ALTER TABLE people_target SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name')")
spark.conf.set("spark.databricks.delta.streaming.allowSourceColumnRenameAndDrop", "always")
people_raw_stream = (spark.readStream
 .option("schemaTrackingLocation", "dbfs:/Volumes/dbacademy/labuser10256553_1747028817/people_target")  
 .table("dbacademy.labuser10256553_1747028817.people_source")
 .writeStream
 .option("checkpointLocation", "/Volumes/dbacademy/labuser10256553_1747028817/people_target")
 .trigger(processingTime="10 second")
 .toTable("dbacademy.labuser10256553_1747028817.people_target"))

Question 1: "schemaTrackingLocation" must start with "dbfs:", if i omit it, it will complain about path not under checkpointLocation, I want to know if this is normal or I missed some configuration.

Question 2: the spark conf is necessary, if not it will complain about the schema modification and offer to set this configuration, there are 2 others to choose from.

Question 3: with all these configuration, I start the streaming process and when it's stable, i run SQL to drop a column from source table "people_source" and the process fails with error:

com.databricks.sql.transaction.tahoe.DeltaRuntimeException: [DELTA_STREAMING_METADATA_EVOLUTION] The schema, table configuration or protocol of your Delta table has changed during streaming.

I need to restart the stream and I see the difference on schema: one more column in target table.
when I insert new rows to source table, they are inserted to target table too, with Null for the missing column.

But I wonder how I should do to avoid stream failure with this drop column operation? Am I missing something?

 

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

Here are answers to your detailed questions about using schemaTrackingLocation for dropping columns in Delta Lake streaming, based on your references and operational experience.​


Question 1: schemaTrackingLocation Path Requirements

Yes, it is normal and required that the schemaTrackingLocation path must start with dbfs: and it must be inside (or equal to) the checkpointLocation. If you provide a path that is not under your checkpoint, you will get an error like:​

"The provided schema location is not under the checkpoint directory."

This is a built-in consistency check: the schema tracking folder must be contained within the checkpoint location for each streaming writer. This ensures that as Spark recovers the state and schema after any restart or failover, the information is in a single logical/managed location.​


Question 2: Spark Conf for Schema Evolution

Yes, the Spark config spark.databricks.delta.streaming.allowSourceColumnRenameAndDrop is essential for supporting non-additive schema evolution (such as dropping or renaming columns) in Delta streaming sources. When you attempt to drop a column in the source and continue streaming, Delta will otherwise block the operation unless you explicitly set this config.​

There are three possible values:

  • "fail" (default): throws an error if columns are dropped/renamed.

  • "warn": logs a warning but runs (only if other settings allow).

  • "always": always allow source column drop/rename, with downstream implications.

It is mandatory to set "always" or choose "warn" depending on your need, or streaming will not carry out drop/rename operations and error out telling you to change the configuration.


Question 3: Why Does Stream Fail on Drop Column? Can It Be Non-Disruptive?

Current Delta Lake streaming cannot support non-disruptive source column drops—even with all options you configured, including schemaTrackingLocation and the Spark conf. As soon as you drop a column in the source, the schema change is detected and the stream fails, by design, to ensure schema enforcement and downstream data consistency.​

  • When restarted, the streaming job will resynchronize with the new schema and start processing.

  • However, during the "snapshot" import, the column will exist as NULL for new data, which is expected.

  • The stream will always fail immediately on a schema drop/change and requires a manual restart.

  • There is currently (as of Delta 3.x, Databricks Runtime 15.4) no option to perform column drops live without a stream interruption—this is a limitation designed to prevent silent data drift.

Schema tracking allows historical data with old schema to be read and new data to be mapped to the new schema—but the streaming process itself is "schema-brittle" and must be restarted on breaking changes.


Best Practices and Workarounds

  • You must always plan for stream restarts if you expect to drop columns in sources used for streaming.

  • Automate your workflow to detect new schema changes and gracefully restart the stream.

  • Use schema tracking to avoid data corruption and to allow for schema evolution over time, but not for uninterrupted schema change adoption.​

  • Older versions of schema data will be read as before, with dropped columns appearing as NULL for new inserts.


References


If you need a workflow to minimize downtime, consider inserting a buffer layer and replaying unprocessed data after the restart, or design your pipeline to tolerate short interruptions during schema drops.
This behavior is expected and rooted in Delta Lake's stringent guarantees about schema compatibility and data contracts.​