Databricks Community

adriennn · ‎08-17-2023

I created a notebook that uses Autoloader to load data from storage and append it to a bronze table in the first cell, this works fine and Autoloader picks up new data when it arrives (the notebook is ran using a Job).

In the same notebook, a few cells below, a query against the table history API of the Bronze table is used to identify new data in the Bronze table and merge it into a Silver table. The notebook uses SQL strictly (i.e. not using foreachbatch and the history query takes into account one or more Autoloaderbatch). We noticed that in some cases (e.g. update to Silver via a view where a lot of transformation happen) the timing of the updates in the underlying delta tables do not allow to update a Silver table right after data ingestion within the same notebook; in other words, even though new data was added in Bronze, running an upsert into Silver right away shows no data was added. But if we add a delay manually (for some table, 1min is not enough), the data comes in fine to Silver.

My question is, is it by design that the timing of the execution of a cell in a job does not reflect what is happening in the background in storage? If it is, is it documented?

We know of streaming tables and are already using DLTs in other notebooks, but I wanted to have some clarity on the above.

adriennn · ‎09-12-2023

Thanks @Retired_mod, in a case where it's not possible or not practical to implement a pipeline with DLTs, what would be that "retry mechanism" based on ? I.e., is there an API other that the table history that can be leveraged to retry until "it works" or do you mean to retry until I see, e.g. a non-zero number of rows inserted in the downstream tables? Thanks!