Adding a new column triggers reprocessing of Auto ...

cosminsanda · ‎03-25-2024

I have a source table A in Unity Catalog. This table is constantly written to and is a streaming table.
I also have another table B in Unity Catalog. This is a managed table with liquid clustering.

Using Auto Loader I move new data from A to B using a code similar to the following:

streaming_query = (
    spark.readStream.option("ignoreDeletes", "true")
    .table("catalog.bronze.A")
    .selectExpr(
        "column_1",
        "column_2"
    )
    .writeStream.option("checkpointLocation", "/Volumes/catalog/silver/checkpoints/B")
    .option("mergeSchema", "true")
    .trigger(availableNow=True)
    .toTable("B")
)

streaming_query.awaitTermination(timeout=300)

Everything was running smoothly, but I then decided to add a new column to the SELECT, so I end up with a SELECT similar to:

.selectExpr(
    "column_1",
    "column_2",
    "column_3"
)

Now, after this configuration change, all the existing data in B got duplicated. The new column was added correctly to the duplicated records. It is as if the whole A table was reprocessed and appended to the already existing data in B.

Is this the expected behaviour based on my configuration? What can I do to avoid this in the future and just have the old data have NULL in the added columns?

Adding a new column triggers reprocessing of Auto Loader source table