Help me understand streaming logic with Delta Tables
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-16-2023 12:00 AM
Hello all
I have a delta table in bronze layer, let's call it BRZ. It contains 25B rows and many duplicates. It has a version 0 and a version 1, nothing else yet.
I then create a silver table SLV by running one deduplication batch job. This creates version 0 of SLV.
Now I want to switch to Spark Streaming, but with a trigger once=True as we only want to run it once a day at the moment. Streaming is a nice way to handle incremental loads automatically.
So my streaming job uses
.option("startingVersion", 2)
and also writes to SLV with a merge statement. Version 2, because versions 0 and 1 were loaded with the batch job.
After updates on BRZ, I have table versions 2, 3 and 4. When running the streaming job for SLV, I expect it to be pretty fast, as it only needs to load versions 2-4, right?
Somehow, the job takes 10+ hours (on a 12 node cluster). Looking at the metrics, numTargetRowsCopied is 25B so it copies all rows again. It also wrote 6000+ parquet files.
So my questions:
Is it even possible to do an initial load and then switch to streaming? Does the startingVersion do what I expect it to do (reading everything after a certain delta table version)? Why does the streaming job take so long?
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-16-2023 03:26 AM
Thanks for the confirmation. Not sure I see everything as your text gets truncated, but it basically confirms that it should work.
Anyway: It looks like the incremental load is working. The problem here is, that we receive late arriving facts that touch all previous months. So the merge statement reads in all existing parquet files and with that, rewrites almost all of them.
We need to see if we can limit the input data to a few months. Don't see any other solution in this case.

