cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Help me understand streaming logic with Delta Tables

pgruetter
Contributor

Hello all

I have a delta table in bronze layer, let's call it BRZ. It contains 25B rows and many duplicates. It has a version 0 and a version 1, nothing else yet. 

I then create a silver table SLV by running one deduplication batch job. This creates version 0 of SLV.

Now I want to switch to Spark Streaming, but with a trigger once=True as we only want to run it once a day at the moment. Streaming is a nice way to handle incremental loads automatically. 

So my streaming job uses 
.option("startingVersion", 2)
and also writes to SLV with a merge statement. Version 2, because versions 0 and 1 were loaded with the batch job.

After updates on BRZ, I have table versions 2, 3 and 4. When running the streaming job for SLV, I expect it to be pretty fast, as it only needs to load versions 2-4, right?
Somehow, the job takes 10+ hours (on a 12 node cluster). Looking at the metrics, numTargetRowsCopied is 25B so it copies all rows again. It also wrote 6000+ parquet files. 

So my questions:
Is it even possible to do an initial load and then switch to streaming? Does the startingVersion do what I expect it to do (reading everything after a certain delta table version)? Why does the streaming job take so long?

Thanks!

 

1 REPLY 1

pgruetter
Contributor

Thanks for the confirmation. Not sure I see everything as your text gets truncated, but it basically confirms that it should work.

Anyway: It looks like the incremental load is working. The problem here is, that we receive late arriving facts that touch all previous months. So the merge statement reads in all existing parquet files and with that, rewrites almost all of them. 

We need to see if we can limit the input data to a few months. Don't see any other solution in this case.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group