Hi @adriennn, Your behaviour is related to the nature of distributed computing and the eventual consistency model. When using Delta Lake with Autoloader, the data is distributed across multiple nodes and updates are propagated across these nodes. However, due to network latencies and the time it takes to update all nodes, there might be a delay before all nodes reflect the latest state of the data. This is not explicitly documented, but it is a well-known characteristic of distributed systems. In your case, the delay between ingesting data into the Bronze table and the availability of that data for querying and further processing (like merging into the Silver table) manifests this characteristic.
To handle this, you may consider the following:
1. Introduce a delay or a retry mechanism in your notebook after the data ingestion and before querying the new data. This would ensure all nodes have been updated with the latest data before you query it.
2. Use Delta Lake's built-in OPTIMIZE
command to compact small files into larger ones, speeding up queries. However, be mindful that OPTIMIZE
it is a resource-intensive operation and should be used judiciously.
3. using Databricks Autoloader with Delta Live Tables (DLTs), you can leverage its built-in mechanisms for handling schema evolution and monitoring via metrics in the event log.
Please note that the timing of the execution of a cell in a job does not necessarily reflect what is happening in the background in storage.
Sources:
- [Docs: etl-quick-start](https://docs.databricks.com/getting-started/etl-quick-start.html)
- [Docs: index](https://docs.databricks.com/ingestion/index.html)
- [Docs: production](https://docs.databricks.com/ingestion/auto-loader/production.html)