โ09-11-2024 01:25 AM - edited โ09-11-2024 01:29 AM
Hey!
I have set up a delta live tables pipeline with bronze and silver tables.
I have one bronze tables which ingest data from a storage account using autoloader.
Multiple files are uploaded at once in the storage account.
My silver tables read and process data from this one bronze table.
Now I want to know: how does it actually process data and in which order?
Suppose there are new files in my storage account on 10 am, and my pipeline triggers at 11 am.
The autoloader detects the new files and ingests it in micro batches depending on the maxFilesPerTrigger (right? that's how I understood it).
So if bronze ingests 'micro batch 1', does silver also immediately process 'micro batch 1'?
Or does silver first wait until all the microbatches are ingested in bronze, before it starts processing this data?
In a streaming context it would make sense to process in micro batches, but in triggered mode I'm not that sure.
In the delta live table UI it seems like it first ingests all the data in bronze before silver starts (looking at when bronze is done and when silver turns green), so I'm not sure.
Thanks in advance!
โ09-11-2024 01:34 AM
Hello!
In a Delta Live Tables pipeline, the processing order depends on whether youโre using streaming or batch mode. In streaming mode, the silver table processes data in micro-batches as soon as the bronze table ingests them. This means that as soon as โmicro batch 1โ is ingested by the bronze table, the silver table starts processing it. In batch mode, the silver table waits until all micro-batches are ingested into the bronze table before starting to process the data. Given your pipeline triggers at 11 am, the autoloader will ingest new files in micro-batches based on maxFilesPerTrigger, and the silver table will process these batches accordingly. Does this clarify the process for you?
โ09-11-2024 04:44 AM
Hi @L1000 ,
I assume you are using the DLT pipeline in triggered mode. The behavior will be as follows:
Autoloader works with a storage account and uses the cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger parameters to control how many files it ingests in a micro-batch. If both parameters are specified Databricks consumes up to the lower limit, whichever is reached first.
When new files are detected (such as those uploaded at 10 AM in your case), Autoloader ingests them in micro-batches. If the pipeline is triggered at 11 AM, it will consume all new data since the last load. In other words, it will consume all micro-batches consisting of the new files in bronze layer processing.
Only then, once the bronze layer processing is finished, DLT will begin processing the silver layer.
โ09-11-2024 01:34 AM
Hello!
In a Delta Live Tables pipeline, the processing order depends on whether youโre using streaming or batch mode. In streaming mode, the silver table processes data in micro-batches as soon as the bronze table ingests them. This means that as soon as โmicro batch 1โ is ingested by the bronze table, the silver table starts processing it. In batch mode, the silver table waits until all micro-batches are ingested into the bronze table before starting to process the data. Given your pipeline triggers at 11 am, the autoloader will ingest new files in micro-batches based on maxFilesPerTrigger, and the silver table will process these batches accordingly. Does this clarify the process for you?
โ09-11-2024 04:44 AM
Hi @L1000 ,
I assume you are using the DLT pipeline in triggered mode. The behavior will be as follows:
Autoloader works with a storage account and uses the cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger parameters to control how many files it ingests in a micro-batch. If both parameters are specified Databricks consumes up to the lower limit, whichever is reached first.
When new files are detected (such as those uploaded at 10 AM in your case), Autoloader ingests them in micro-batches. If the pipeline is triggered at 11 AM, it will consume all new data since the last load. In other words, it will consume all micro-batches consisting of the new files in bronze layer processing.
Only then, once the bronze layer processing is finished, DLT will begin processing the silver layer.
โ09-11-2024 04:55 AM
Thanks @szymon_dybczak for your reply!
Do you maybe have a link to documentation for this? I didn't find a lot of info about this and would love to read more on how Delta Live Tables work in Triggered vs Streaming mode ๐
โ09-11-2024 05:11 AM
Yep, here it is: Run an update on a Delta Live Tables pipeline | Databricks on AWS
Also, you can read about Structured Streaming, because under the hood DLT pipelines use structured streaming mechanism.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group