Hi, I'm trying to understand the usage of "schemaTrackLocation" with schema evolution.
I use these articles as references:
https://docs.delta.io/latest/delta-streaming.html#tracking-non-additive-schema-changes
https://docs.databricks.com/aws/en/error-messages/error-classes#delta_streaming_schema_location_not_...
What I want to do is "simple":
I create readstream on source delta table,
I writeStream to target table with the same schema without transformation on data,
What I want to achieve: with the option("schemaTrackingLocation", "true") I want to drop a column in source table and the streaming process continue without error.
I work on Academy Lab with a All-purpose compute, Runtime 15.4, Python notebook.This is my code.
spark.sql(f"ALTER TABLE people_source SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name')")
spark.sql(f"ALTER TABLE people_target SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name')")
spark.conf.set("spark.databricks.delta.streaming.allowSourceColumnRenameAndDrop", "always")
people_raw_stream = (spark.readStream
.option("schemaTrackingLocation", "dbfs:/Volumes/dbacademy/labuser10256553_1747028817/people_target")
.table("dbacademy.labuser10256553_1747028817.people_source")
.writeStream
.option("checkpointLocation", "/Volumes/dbacademy/labuser10256553_1747028817/people_target")
.trigger(processingTime="10 second")
.toTable("dbacademy.labuser10256553_1747028817.people_target"))
Question 1: "schemaTrackingLocation" must start with "dbfs:", if i omit it, it will complain about path not under checkpointLocation, I want to know if this is normal or I missed some configuration.
Question 2: the spark conf is necessary, if not it will complain about the schema modification and offer to set this configuration, there are 2 others to choose from.
Question 3: with all these configuration, I start the streaming process and when it's stable, i run SQL to drop a column from source table "people_source" and the process fails with error:
com.databricks.sql.transaction.tahoe.DeltaRuntimeException: [DELTA_STREAMING_METADATA_EVOLUTION] The schema, table configuration or protocol of your Delta table has changed during streaming.
I need to restart the stream and I see the difference on schema: one more column in target table.
when I insert new rows to source table, they are inserted to target table too, with Null for the missing column.
But I wonder how I should do to avoid stream failure with this drop column operation? Am I missing something?