Databricks Community

pgruetter · ‎11-16-2023

Hello all

I have a delta table in bronze layer, let's call it BRZ. It contains 25B rows and many duplicates. It has a version 0 and a version 1, nothing else yet.

I then create a silver table SLV by running one deduplication batch job. This creates version 0 of SLV.

Now I want to switch to Spark Streaming, but with a trigger once=True as we only want to run it once a day at the moment. Streaming is a nice way to handle incremental loads automatically.

So my streaming job uses
.option("startingVersion", 2)
and also writes to SLV with a merge statement. Version 2, because versions 0 and 1 were loaded with the batch job.

After updates on BRZ, I have table versions 2, 3 and 4. When running the streaming job for SLV, I expect it to be pretty fast, as it only needs to load versions 2-4, right?
Somehow, the job takes 10+ hours (on a 12 node cluster). Looking at the metrics, numTargetRowsCopied is 25B so it copies all rows again. It also wrote 6000+ parquet files.

So my questions:
Is it even possible to do an initial load and then switch to streaming? Does the startingVersion do what I expect it to do (reading everything after a certain delta table version)? Why does the streaming job take so long?

Thanks!

pgruetter · ‎11-16-2023

Thanks for the confirmation. Not sure I see everything as your text gets truncated, but it basically confirms that it should work.

Anyway: It looks like the incremental load is working. The problem here is, that we receive late arriving facts that touch all previous months. So the merge statement reads in all existing parquet files and with that, rewrites almost all of them.

We need to see if we can limit the input data to a few months. Don't see any other solution in this case.

Databricks Community

Help me understand streaming logic with Delta Tables

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 28 – December 04, 2025

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐