cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Schema Evolution with "schemaTrackingLocation" fails anyway

MingOnCloud
New Contributor II

Hi, I'm trying to understand the usage of "schemaTrackLocation" with schema evolution.

I use these articles as references:

https://docs.delta.io/latest/delta-streaming.html#tracking-non-additive-schema-changes

https://docs.databricks.com/aws/en/error-messages/error-classes#delta_streaming_schema_location_not_...

What I want to do is "simple":

I create readstream on source delta table,

I writeStream to target table with the same schema without transformation on data,

What I want to achieve: with the option("schemaTrackingLocation", "true") I want to drop a column in source table and the streaming process continue without error.

I work on Academy Lab with a All-purpose compute, Runtime 15.4, Python notebook.This is my code.

spark.sql(f"ALTER TABLE people_source SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name')")

spark.sql(f"ALTER TABLE people_target SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name')")
spark.conf.set("spark.databricks.delta.streaming.allowSourceColumnRenameAndDrop", "always")
people_raw_stream = (spark.readStream
 .option("schemaTrackingLocation", "dbfs:/Volumes/dbacademy/labuser10256553_1747028817/people_target")  
 .table("dbacademy.labuser10256553_1747028817.people_source")
 .writeStream
 .option("checkpointLocation", "/Volumes/dbacademy/labuser10256553_1747028817/people_target")
 .trigger(processingTime="10 second")
 .toTable("dbacademy.labuser10256553_1747028817.people_target"))

Question 1: "schemaTrackingLocation" must start with "dbfs:", if i omit it, it will complain about path not under checkpointLocation, I want to know if this is normal or I missed some configuration.

Question 2: the spark conf is necessary, if not it will complain about the schema modification and offer to set this configuration, there are 2 others to choose from.

Question 3: with all these configuration, I start the streaming process and when it's stable, i run SQL to drop a column from source table "people_source" and the process fails with error:

com.databricks.sql.transaction.tahoe.DeltaRuntimeException: [DELTA_STREAMING_METADATA_EVOLUTION] The schema, table configuration or protocol of your Delta table has changed during streaming.

I need to restart the stream and I see the difference on schema: one more column in target table.
when I insert new rows to source table, they are inserted to target table too, with Null for the missing column.

But I wonder how I should do to avoid stream failure with this drop column operation? Am I missing something?

 

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now