cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Live table - Adding streaming to existing table

Manzilla
New Contributor II

Currently, the bronze table ingests JSON files using @Dlt.table decorator on a spark.readStream function

A daily batch job does some transformation on bronze data and stores results in the silver table.

New Process

Bronze still the same.

A stream has been created to ingest the bronze table into a view where the data transformation occurs and is used as the source for the silver table that is updated with dlt.apply_changes.

dlt,apply_changes adds 4 hidden columns for tracking my question is what will happen when this runs the first time against the production data?

Will the stream that handles the stream associated with the silver process look at the entire bronze table and reprocess it or will it start from the current date/time and move forward?

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @Manzilla, When using Delta Live Tables’ dlt.apply_changes for change data capture (CDC), it’s essential to understand how it works.

Let’s break down the process and address your specific scenario:

  1. CDC with Delta Live Tables:

    • Delta Live Tables simplifies CDC using the APPLY CHANGES API. Unlike the traditional MERGE INTO statement, which can be error-prone due to out-of-sequence records, the APPLY CHANGES API handles out-of-sequence records automatically.
    • You specify a column in the source data that represents the proper ordering of records (usually a monotonically increasing value). Delta Live Tables uses this column to handle data that arrives out of order.
    • For SCD Type 2 changes (historical tracking), Delta Live Tables propagates sequencing values to the ...1.
  2. Your Scenario:

    • You’ve created a stream to ingest the bronze table into a view where data transformation occurs. This transformed view serves as the source for the silver table.
    • When you run dlt.apply_changes against the production data for the first time, here’s what happens:
      • The stream associated with the silver process will process changes based on the specified keys and sequencing.
      • It won’t reprocess the entire bronze table. Instead, it will start from the current date/time and move forward, capturing changes since the last checkpoint.
      • The hidden columns added by dlt.apply_changes (such as version columns) help track the changes and ensure correct processing.
  3. Important Note:

In summary, your stream will process changes incrementally, starting from the current state, rather than reprocessing the entire bronze table. This approach ensures efficient and accurate CDC for your silver table. If you encounter any issues or need further assistance, feel free to ask! 😊🚀12

 

Manzilla
New Contributor II

Thank you thats what I understood too.  It is just nice to get validation from someone else that works with this.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group