Hello all
I have a delta table in bronze layer, let's call it BRZ. It contains 25B rows and many duplicates. It has a version 0 and a version 1, nothing else yet.
I then create a silver table SLV by running one deduplication batch job. This creates version 0 of SLV.
Now I want to switch to Spark Streaming, but with a trigger once=True as we only want to run it once a day at the moment. Streaming is a nice way to handle incremental loads automatically.
So my streaming job uses
.option("startingVersion", 2)
and also writes to SLV with a merge statement. Version 2, because versions 0 and 1 were loaded with the batch job.
After updates on BRZ, I have table versions 2, 3 and 4. When running the streaming job for SLV, I expect it to be pretty fast, as it only needs to load versions 2-4, right?
Somehow, the job takes 10+ hours (on a 12 node cluster). Looking at the metrics, numTargetRowsCopied is 25B so it copies all rows again. It also wrote 6000+ parquet files.
So my questions:
Is it even possible to do an initial load and then switch to streaming? Does the startingVersion do what I expect it to do (reading everything after a certain delta table version)? Why does the streaming job take so long?
Thanks!