cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Delay when updating Bronze and Silver tables in the same notebook (DBR 13.1)

adriennn
Contributor II

I created a notebook that uses Autoloader to load data from storage and append it to a bronze table in the first cell, this works fine and Autoloader picks up new data when it arrives (the notebook is ran using a Job).

In the same notebook, a few cells below, a query against the table history API of the Bronze table is used to identify new data in the Bronze table and merge it into a Silver table. The notebook uses SQL strictly (i.e. not using foreachbatch and the history query takes into account one or more Autoloaderbatch). We noticed that in some cases (e.g. update to Silver via a view where a lot of transformation happen) the timing of the updates in the underlying delta tables do not allow to update a Silver table right after data ingestion within the same notebook; in other words, even though new data was added in Bronze, running an upsert into Silver right away shows no data was added. But if we add a delay manually (for some table, 1min is not enough), the data comes in fine to Silver.

My question is, is it by design that the timing of the execution of a cell in a job does not reflect what is happening in the background in storage? If it is, is it documented?

We know of streaming tables and are already using DLTs in other notebooks, but I wanted to have some clarity on the above.

1 REPLY 1

adriennn
Contributor II

Thanks @Retired_mod, in a case where it's not possible or not practical to implement a pipeline  with DLTs, what would be that "retry mechanism" based on ? I.e., is there an API other that the table history that can be leveraged to retry until "it works" or do you mean to retry until I see, e.g. a non-zero number of rows inserted in the downstream tables? Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group