cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Live Tables (Triggered Mode), how and in what order is the data processed?

L1000
New Contributor II

Hey!

I have set up a delta live tables pipeline with bronze and silver tables.
I have one bronze tables which ingest data from a storage account using autoloader.
Multiple files are uploaded at once in the storage account.
My silver tables read and process data from this one bronze table.

Now I want to know: how does it actually process data and in which order?

Suppose there are new files in my storage account on 10 am, and my pipeline triggers at 11 am.
The autoloader detects the new files and ingests it in micro batches depending on the maxFilesPerTrigger (right? that's how I understood it).

So if bronze ingests 'micro batch 1', does silver also immediately process 'micro batch 1'? 
Or does silver first wait until all the microbatches are ingested in bronze, before it starts processing this data?

In a streaming context it would make sense to process in micro batches, but in triggered mode I'm not that sure.
In the delta live table UI it seems like it first ingests all the data in bronze before silver starts (looking at when bronze is done and when silver turns green), so I'm not sure.

Thanks in advance!

2 ACCEPTED SOLUTIONS

Accepted Solutions

Anne165Hernadez
New Contributor III

Hello!

In a Delta Live Tables pipeline, the processing order depends on whether you’re using streaming or batch mode. In streaming mode, the silver table processes data in micro-batches as soon as the bronze table ingests them. This means that as soon as ‘micro batch 1’ is ingested by the bronze table, the silver table starts processing it. In batch mode, the silver table waits until all micro-batches are ingested into the bronze table before starting to process the data. Given your pipeline triggers at 11 am, the autoloader will ingest new files in micro-batches based on maxFilesPerTrigger, and the silver table will process these batches accordingly. Does this clarify the process for you? 

View solution in original post

szymon_dybczak
Contributor III

Hi @L1000 ,

I assume you are using the DLT pipeline in triggered mode. The behavior will be as follows:

Autoloader works with a storage account and uses the  cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger parameters to control how many files it ingests in a micro-batch. If both parameters are specified Databricks consumes up to the lower limit, whichever is reached first.

When new files are detected (such as those uploaded at 10 AM in your case), Autoloader ingests them in micro-batches. If the pipeline is triggered at 11 AM, it will consume all new data since the last load. In other words, it will consume all micro-batches consisting of the new files in bronze layer processing.

Only then, once the bronze layer processing is finished, DLT will begin processing the silver layer.

 

View solution in original post

4 REPLIES 4

Anne165Hernadez
New Contributor III

Hello!

In a Delta Live Tables pipeline, the processing order depends on whether you’re using streaming or batch mode. In streaming mode, the silver table processes data in micro-batches as soon as the bronze table ingests them. This means that as soon as ‘micro batch 1’ is ingested by the bronze table, the silver table starts processing it. In batch mode, the silver table waits until all micro-batches are ingested into the bronze table before starting to process the data. Given your pipeline triggers at 11 am, the autoloader will ingest new files in micro-batches based on maxFilesPerTrigger, and the silver table will process these batches accordingly. Does this clarify the process for you? 

szymon_dybczak
Contributor III

Hi @L1000 ,

I assume you are using the DLT pipeline in triggered mode. The behavior will be as follows:

Autoloader works with a storage account and uses the  cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger parameters to control how many files it ingests in a micro-batch. If both parameters are specified Databricks consumes up to the lower limit, whichever is reached first.

When new files are detected (such as those uploaded at 10 AM in your case), Autoloader ingests them in micro-batches. If the pipeline is triggered at 11 AM, it will consume all new data since the last load. In other words, it will consume all micro-batches consisting of the new files in bronze layer processing.

Only then, once the bronze layer processing is finished, DLT will begin processing the silver layer.

 

L1000
New Contributor II

Thanks @szymon_dybczak for your reply!
Do you maybe have a link to documentation for this? I didn't find a lot of info about this and would love to read more on how Delta Live Tables work in Triggered vs Streaming mode 🙂 

Yep, here it is: Run an update on a Delta Live Tables pipeline | Databricks on AWS

Also, you can read about Structured Streaming, because under the hood DLT pipelines use structured streaming mechanism.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group