cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Delay when updating Bronze and Silver tables in the same notebook (DBR 13.1)

adriennn
Contributor

I created a notebook that uses Autoloader to load data from storage and append it to a bronze table in the first cell, this works fine and Autoloader picks up new data when it arrives (the notebook is ran using a Job).

In the same notebook, a few cells below, a query against the table history API of the Bronze table is used to identify new data in the Bronze table and merge it into a Silver table. The notebook uses SQL strictly (i.e. not using foreachbatch and the history query takes into account one or more Autoloaderbatch). We noticed that in some cases (e.g. update to Silver via a view where a lot of transformation happen) the timing of the updates in the underlying delta tables do not allow to update a Silver table right after data ingestion within the same notebook; in other words, even though new data was added in Bronze, running an upsert into Silver right away shows no data was added. But if we add a delay manually (for some table, 1min is not enough), the data comes in fine to Silver.

My question is, is it by design that the timing of the execution of a cell in a job does not reflect what is happening in the background in storage? If it is, is it documented?

We know of streaming tables and are already using DLTs in other notebooks, but I wanted to have some clarity on the above.

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @adriennn, Your behaviour is related to the nature of distributed computing and the eventual consistency model. When using Delta Lake with Autoloader, the data is distributed across multiple nodes and updates are propagated across these nodes. However, due to network latencies and the time it takes to update all nodes, there might be a delay before all nodes reflect the latest state of the data. This is not explicitly documented, but it is a well-known characteristic of distributed systems. In your case, the delay between ingesting data into the Bronze table and the availability of that data for querying and further processing (like merging into the Silver table) manifests this characteristic.

To handle this, you may consider the following:

1. Introduce a delay or a retry mechanism in your notebook after the data ingestion and before querying the new data. This would ensure all nodes have been updated with the latest data before you query it.

2. Use Delta Lake's built-in OPTIMIZE command to compact small files into larger ones, speeding up queries. However, be mindful that OPTIMIZE it is a resource-intensive operation and should be used judiciously.

3. using Databricks Autoloader with Delta Live Tables (DLTs), you can leverage its built-in mechanisms for handling schema evolution and monitoring via metrics in the event log.

Please note that the timing of the execution of a cell in a job does not necessarily reflect what is happening in the background in storage.

Sources:
- [Docs: etl-quick-start](https://docs.databricks.com/getting-started/etl-quick-start.html)
- [Docs: index](https://docs.databricks.com/ingestion/index.html)
- [Docs: production](https://docs.databricks.com/ingestion/auto-loader/production.html)

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @adriennn, Your behaviour is related to the nature of distributed computing and the eventual consistency model. When using Delta Lake with Autoloader, the data is distributed across multiple nodes and updates are propagated across these nodes. However, due to network latencies and the time it takes to update all nodes, there might be a delay before all nodes reflect the latest state of the data. This is not explicitly documented, but it is a well-known characteristic of distributed systems. In your case, the delay between ingesting data into the Bronze table and the availability of that data for querying and further processing (like merging into the Silver table) manifests this characteristic.

To handle this, you may consider the following:

1. Introduce a delay or a retry mechanism in your notebook after the data ingestion and before querying the new data. This would ensure all nodes have been updated with the latest data before you query it.

2. Use Delta Lake's built-in OPTIMIZE command to compact small files into larger ones, speeding up queries. However, be mindful that OPTIMIZE it is a resource-intensive operation and should be used judiciously.

3. using Databricks Autoloader with Delta Live Tables (DLTs), you can leverage its built-in mechanisms for handling schema evolution and monitoring via metrics in the event log.

Please note that the timing of the execution of a cell in a job does not necessarily reflect what is happening in the background in storage.

Sources:
- [Docs: etl-quick-start](https://docs.databricks.com/getting-started/etl-quick-start.html)
- [Docs: index](https://docs.databricks.com/ingestion/index.html)
- [Docs: production](https://docs.databricks.com/ingestion/auto-loader/production.html)

adriennn
Contributor

Thanks @Kaniz, in a case where it's not possible or not practical to implement a pipeline  with DLTs, what would be that "retry mechanism" based on ? I.e., is there an API other that the table history that can be leveraged to retry until "it works" or do you mean to retry until I see, e.g. a non-zero number of rows inserted in the downstream tables? Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.