cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Data Lineage with Apply Changes

Gareema
New Contributor III

Hello Team

I am using DLT. I am able to see the lineage when doing normal process. However as soon as I use 'APPLY_Changes' feature after the lineag ebreaks and I am no more able to see the Data lineage from the catalog after going to table. 

Is there any way that I can use Apply_Changes and have the lineage retained? 

Code to load in silver table:

Gareema_0-1720379993368.png

 

4 REPLIES 4

Gareema
New Contributor III

One point here is, both the tables are in different schema. Hence different DLT pipelines and notebooks are there.

Kaniz_Fatma
Community Manager
Community Manager

Hi @GareemaHello! I understand your concern about the data lineage breaking when using the 'APPLY_Changes' feature in DLT (Delta Live Tables). This is a common issue that users face, but there are a few ways to address it and maintain the lineage even after applying changes.

Approach 1: Use the MERGE Statement

Instead of using the 'APPLY_Changes' feature, you can use the MERGE statement to update the target table. The MERGE statement allows you to update, insert, or delete rows in the target table based on the changes in the source table while preserving the data lineage.

Here's an example of how you can use the MERGE statement to load data into a silver table:

from delta.tables import DeltaTable

# Read the source data
source_df = spark.read.format("delta").load("path/to/source/table")

# Get the target table
target_table = DeltaTable.forPath(spark, "path/to/silver/table")

# Merge the source data into the target table
target_table.alias("target") \
  .merge(source_df.alias("source"), "target.id = source.id") \
  .whenMatchedUpdateAll() \
  .whenNotMatchedInsertAll() \
  .execute()

Approach 2: Use the OVERWRITE Mode

Another option is to use the โ€™OVERWRITEโ€™ mode instead of โ€™APPLY_Changesโ€™. The โ€™OVERWRITEโ€™ mode replaces the entire table with the new data, which can help maintain the data lineage.
 
Here's an example of how you can use the โ€™OVERWRITEโ€™ mode to load data into a silver table:
from delta.tables import DeltaTable

# Read the source data
source_df = spark.read.format("delta").load("path/to/source/table")

# Write the source data to the silver table in overwrite mode
source_df.write.format("delta").mode("overwrite").save("path/to/silver/table")
 
By using the 'OVERWRITE' mode, you can ensure that the data lineage is preserved, and you can still see the lineage in the catalog after the update.Both of these approaches should help you maintain the data lineage even after using the 'APPLY_Changes' feature.
 
If you have any further questions or need additional assistance, feel free to ask.

Gareema
New Contributor III

@Kaniz_Fatma Thank you for your response. However, in both approaches we are avoiding the 'APPLY_Changes' feature. In reality I want to utilise that feature to have the latest record based on the 'Sequence Key'. 

With merge or overwrite I will not be able to have the flexibility of getting just the last updated row based on sequence. 

Gareema
New Contributor III

@Kaniz_Fatma : Is there any way this can be achieved or can we expect this problem to be resolved in next releases?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group