topic Re: Delta Live Tables are refreshed in parallel rather than sequentially in Data Engineering

Delta Live Tables are refreshed in parallel rather than sequentially

BobCat62 — Fri, 07 Mar 2025 16:03:55 GMT

Hi experts,

I have defined my DLT Pipeline as follows:

-- Define a streaming table to ingest data from a volume CREATE OR REFRESH STREAMING TABLE pumpdata_bronze TBLPROPERTIES ("myCompanyPipeline.quality" = "bronze") AS SELECT * FROM cloud_files("abfss://xxx@xxx.dfs.core.windows.net/xxx/*/*/*/*/*.JSON","JSON"); --Define a streaming table to ingest data from a volume CREATE OR REFRESH STREAMING TABLE pumpdata_silver PARTITIONED BY (extracted_date) COMMENT "The cleaned sales orders with valid order_number(s) and partitioned by order_datetime." TBLPROPERTIES ("myCompanyPipeline.quality" = "silver") AS SELECT DATE(EnqueuedTimeUtc) AS extracted_date, DATE_FORMAT(EnqueuedTimeUtc, 'HH:mm:ss') AS extracted_time, ROUND(Body:distance, 2) AS distance FROM STREAM(bstdwh.pumpdata_bronze) where Body is not null;

When I start this pipeline, I expect the Bronze table to refresh first, followed by the Silver table after its completion. However, both run in parallel, causing the Silver table to miss the latest data.

Did I miss some settings?

Re: Delta Live Tables are refreshed in parallel rather than sequentially

Rjdudley — Fri, 07 Mar 2025 16:57:47 GMT

Is all of this code in the same notebook? If so, this sounds like the expected behavior, it's a performance optimization. If you need sequential execution you put the code into two notebooks and make a pipeline.

Re: Delta Live Tables are refreshed in parallel rather than sequentially

BobCat62 — Fri, 07 Mar 2025 20:47:02 GMT

Yes it is. All code is in one notebook. But the code of sample-DLT-pipeline-notebook is also in one notebook, but the run is sequential:

Re: Delta Live Tables are refreshed in parallel rather than sequentially

ashraf1395 — Sat, 08 Mar 2025 04:41:59 GMT

Hi @BobCat62 ,

So the thing is Now dlt has different modes dlt direct publishing mode , classic mode(legacy). Look here for mode details : https://docs.databricks.com/aws/en/release-notes/product/2025/january#dlt-now-supports-publishing-to-tables-in-multiple-schemas-and-catalogs

1. if you are using legacy mode in dlt configuration setting { target variable will be defined(basically the default schema of the pipeline)}, so if using this method dlt expects you to use live.pumpdata_silver on your table where you want it to be dependent on the first pumpdata_bronze table. It makes sure that refreshing of the dependent table starts only when the bronze refreshing is done hence, the latest records.

Though above method is legacy now. Its a best practice to follow latest advancements.

2. dlt direct publishing mode, in your dlt pipeline configuration (if you use schema var instead of target var (both have same use but are mutually exclusive only one can be used) , then it automatically means your pipeline is in latest mode hence live is not required and dlt will automatically handle all the dependencies itself.

I haven't used sequentialityin direct publishing moe but the above link would have some guidelines on it.