Databricks Community

BobCat62 · ‎03-07-2025

Hi experts,

I have defined my DLT Pipeline as follows:

-- Define a streaming table to ingest data from a volume CREATE OR REFRESH STREAMING TABLE pumpdata_bronze TBLPROPERTIES ("myCompanyPipeline.quality" = "bronze") AS SELECT * FROM cloud_files("abfss://xxx@xxx.dfs.core.windows.net/xxx/*/*/*/*/*.JSON","JSON"); --Define a streaming table to ingest data from a volume CREATE OR REFRESH STREAMING TABLE pumpdata_silver PARTITIONED BY (extracted_date) COMMENT "The cleaned sales orders with valid order_number(s) and partitioned by order_datetime." TBLPROPERTIES ("myCompanyPipeline.quality" = "silver") AS SELECT DATE(EnqueuedTimeUtc) AS extracted_date, DATE_FORMAT(EnqueuedTimeUtc, 'HH:mm:ss') AS extracted_time, ROUND(Body:distance, 2) AS distance FROM STREAM(bstdwh.pumpdata_bronze) where Body is not null;

When I start this pipeline, I expect the Bronze table to refresh first, followed by the Silver table after its completion. However, both run in parallel, causing the Silver table to miss the latest data.

Did I miss some settings?

ashraf1395 · ‎03-07-2025

Hi @BobCat62 ,

So the thing is Now dlt has different modes dlt direct publishing mode , classic mode(legacy). Look here for mode details : https://docs.databricks.com/aws/en/release-notes/product/2025/january#dlt-now-supports-publishing-to...

1. if you are using legacy mode in dlt configuration setting { target variable will be defined(basically the default schema of the pipeline)}, so if using this method dlt expects you to use live.pumpdata_silver on your table where you want it to be dependent on the first pumpdata_bronze table. It makes sure that refreshing of the dependent table starts only when the bronze refreshing is done hence, the latest records.

Though above method is legacy now. Its a best practice to follow latest advancements.

2. dlt direct publishing mode, in your dlt pipeline configuration (if you use schema var instead of target var (both have same use but are mutually exclusive only one can be used) , then it automatically means your pipeline is in latest mode hence live is not required and dlt will automatically handle all the dependencies itself.

I haven't used sequentialityin direct publishing moe but the above link would have some guidelines on it.

View solution in original post

Rjdudley · ‎03-07-2025

Is all of this code in the same notebook? If so, this sounds like the expected behavior, it's a performance optimization. If you need sequential execution you put the code into two notebooks and make a pipeline.

BobCat62 · ‎03-07-2025

Yes it is. All code is in one notebook. But the code of sample-DLT-pipeline-notebook is also in one notebook, but the run is sequential:

ashraf1395 · ‎03-07-2025

Hi @BobCat62 ,

So the thing is Now dlt has different modes dlt direct publishing mode , classic mode(legacy). Look here for mode details : https://docs.databricks.com/aws/en/release-notes/product/2025/january#dlt-now-supports-publishing-to...

1. if you are using legacy mode in dlt configuration setting { target variable will be defined(basically the default schema of the pipeline)}, so if using this method dlt expects you to use live.pumpdata_silver on your table where you want it to be dependent on the first pumpdata_bronze table. It makes sure that refreshing of the dependent table starts only when the bronze refreshing is done hence, the latest records.

Though above method is legacy now. Its a best practice to follow latest advancements.

2. dlt direct publishing mode, in your dlt pipeline configuration (if you use schema var instead of target var (both have same use but are mutually exclusive only one can be used) , then it automatically means your pipeline is in latest mode hence live is not required and dlt will automatically handle all the dependencies itself.

I haven't used sequentialityin direct publishing moe but the above link would have some guidelines on it.