cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Live table - Adding streaming to existing table

Manzilla
New Contributor II

Currently, the bronze table ingests JSON files using @Dlt.table decorator on a spark.readStream function

A daily batch job does some transformation on bronze data and stores results in the silver table.

New Process

Bronze still the same.

A stream has been created to ingest the bronze table into a view where the data transformation occurs and is used as the source for the silver table that is updated with dlt.apply_changes.

dlt,apply_changes adds 4 hidden columns for tracking my question is what will happen when this runs the first time against the production data?

Will the stream that handles the stream associated with the silver process look at the entire bronze table and reprocess it or will it start from the current date/time and move forward?

2 REPLIES 2

Manzilla
New Contributor II

Thank you thats what I understood too.  It is just nice to get validation from someone else that works with this.

Sidhant07
Databricks Employee
Databricks Employee

When you use `dlt.apply_changes` to update the silver table, it adds four hidden columns for tracking changes. These columns include `event_time`, `read_version`, `commit_version`, and `is_deleted`.

When you run this process for the first time against the production data, the stream that handles the silver process will not reprocess the entire bronze table. 

This is because the stream processing in Delta Live Tables (DLT) is designed to process new data as it arrives, rather than reprocessing all the data each time. When you start the stream,the checkpoint is used to determine where to start processing data the next time the stream is started.

So, when you run the stream for the first time against the production data, it will start processing data from the current date/time and move forward, using the checkpoint to keep track of its progress. It will not reprocess the entire bronze table, unless you explicitly configure it to do so.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group