Databricks Community

L1000 · ‎09-11-2024

Hey!

I have set up a delta live tables pipeline with bronze and silver tables.
I have one bronze tables which ingest data from a storage account using autoloader.
Multiple files are uploaded at once in the storage account.
My silver tables read and process data from this one bronze table.

Now I want to know: how does it actually process data and in which order?

Suppose there are new files in my storage account on 10 am, and my pipeline triggers at 11 am.
The autoloader detects the new files and ingests it in micro batches depending on the maxFilesPerTrigger (right? that's how I understood it).

So if bronze ingests 'micro batch 1', does silver also immediately process 'micro batch 1'?
Or does silver first wait until all the microbatches are ingested in bronze, before it starts processing this data?

In a streaming context it would make sense to process in micro batches, but in triggered mode I'm not that sure.
In the delta live table UI it seems like it first ingests all the data in bronze before silver starts (looking at when bronze is done and when silver turns green), so I'm not sure.

Thanks in advance!

Anne165Hernadez · ‎09-11-2024

Hello!

In a Delta Live Tables pipeline, the processing order depends on whether you’re using streaming or batch mode. In streaming mode, the silver table processes data in micro-batches as soon as the bronze table ingests them. This means that as soon as ‘micro batch 1’ is ingested by the bronze table, the silver table starts processing it. In batch mode, the silver table waits until all micro-batches are ingested into the bronze table before starting to process the data. Given your pipeline triggers at 11 am, the autoloader will ingest new files in micro-batches based on maxFilesPerTrigger, and the silver table will process these batches accordingly. Does this clarify the process for you?

View solution in original post

szymon_dybczak · ‎09-11-2024

Hi @L1000 ,

I assume you are using the DLT pipeline in triggered mode. The behavior will be as follows:

Autoloader works with a storage account and uses the cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger parameters to control how many files it ingests in a micro-batch. If both parameters are specified Databricks consumes up to the lower limit, whichever is reached first.

When new files are detected (such as those uploaded at 10 AM in your case), Autoloader ingests them in micro-batches. If the pipeline is triggered at 11 AM, it will consume all new data since the last load. In other words, it will consume all micro-batches consisting of the new files in bronze layer processing.

Only then, once the bronze layer processing is finished, DLT will begin processing the silver layer.

View solution in original post

Anne165Hernadez · ‎09-11-2024

Hello!

In a Delta Live Tables pipeline, the processing order depends on whether you’re using streaming or batch mode. In streaming mode, the silver table processes data in micro-batches as soon as the bronze table ingests them. This means that as soon as ‘micro batch 1’ is ingested by the bronze table, the silver table starts processing it. In batch mode, the silver table waits until all micro-batches are ingested into the bronze table before starting to process the data. Given your pipeline triggers at 11 am, the autoloader will ingest new files in micro-batches based on maxFilesPerTrigger, and the silver table will process these batches accordingly. Does this clarify the process for you?

szymon_dybczak · ‎09-11-2024

Hi @L1000 ,

I assume you are using the DLT pipeline in triggered mode. The behavior will be as follows:

Autoloader works with a storage account and uses the cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger parameters to control how many files it ingests in a micro-batch. If both parameters are specified Databricks consumes up to the lower limit, whichever is reached first.

When new files are detected (such as those uploaded at 10 AM in your case), Autoloader ingests them in micro-batches. If the pipeline is triggered at 11 AM, it will consume all new data since the last load. In other words, it will consume all micro-batches consisting of the new files in bronze layer processing.

Only then, once the bronze layer processing is finished, DLT will begin processing the silver layer.

L1000 · ‎09-11-2024

Thanks @szymon_dybczak for your reply!
Do you maybe have a link to documentation for this? I didn't find a lot of info about this and would love to read more on how Delta Live Tables work in Triggered vs Streaming mode 🙂