Adding a new column triggers reprocessing of Auto Loader source table
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-25-2024 09:19 AM
I have a source table A in Unity Catalog. This table is constantly written to and is a streaming table.
I also have another table B in Unity Catalog. This is a managed table with liquid clustering.
Using Auto Loader I move new data from A to B using a code similar to the following:
streaming_query = (
spark.readStream.option("ignoreDeletes", "true")
.table("catalog.bronze.A")
.selectExpr(
"column_1",
"column_2"
)
.writeStream.option("checkpointLocation", "/Volumes/catalog/silver/checkpoints/B")
.option("mergeSchema", "true")
.trigger(availableNow=True)
.toTable("B")
)
streaming_query.awaitTermination(timeout=300)
Everything was running smoothly, but I then decided to add a new column to the SELECT, so I end up with a SELECT similar to:
.selectExpr(
"column_1",
"column_2",
"column_3"
)
Now, after this configuration change, all the existing data in B got duplicated. The new column was added correctly to the duplicated records. It is as if the whole A table was reprocessed and appended to the already existing data in B.
Is this the expected behaviour based on my configuration? What can I do to avoid this in the future and just have the old data have NULL in the added columns?