Databricks Community

pgruetter · ‎11-16-2023

Hello all

I have a delta table in bronze layer, let's call it BRZ. It contains 25B rows and many duplicates. It has a version 0 and a version 1, nothing else yet.

I then create a silver table SLV by running one deduplication batch job. This creates version 0 of SLV.

Now I want to switch to Spark Streaming, but with a trigger once=True as we only want to run it once a day at the moment. Streaming is a nice way to handle incremental loads automatically.

So my streaming job uses
.option("startingVersion", 2)
and also writes to SLV with a merge statement. Version 2, because versions 0 and 1 were loaded with the batch job.

After updates on BRZ, I have table versions 2, 3 and 4. When running the streaming job for SLV, I expect it to be pretty fast, as it only needs to load versions 2-4, right?
Somehow, the job takes 10+ hours (on a 12 node cluster). Looking at the metrics, numTargetRowsCopied is 25B so it copies all rows again. It also wrote 6000+ parquet files.

So my questions:
Is it even possible to do an initial load and then switch to streaming? Does the startingVersion do what I expect it to do (reading everything after a certain delta table version)? Why does the streaming job take so long?

Thanks!

pgruetter · ‎11-16-2023

Thanks for the confirmation. Not sure I see everything as your text gets truncated, but it basically confirms that it should work.

Anyway: It looks like the incremental load is working. The problem here is, that we receive late arriving facts that touch all previous months. So the merge statement reads in all existing parquet files and with that, rewrites almost all of them.

We need to see if we can limit the input data to a few months. Don't see any other solution in this case.

Databricks Community

Help me understand streaming logic with Delta Tables

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon