cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to deal with delete records from the source Files in DLT .

ManojReddy
New Contributor II

Can apply_changes feature deal with deleted records in incoming source Files?
By delete I mean record is being removed (Not a soft delete with Flag).
If not, how to automate with deleting records from Bronze Streaming table based on source Files.

5 REPLIES 5

Kaniz_Fatma
Community Manager
Community Manager

Hi @ManojReddy , Certainly! Letโ€™s address your questions regarding change data capture (CDC) and automating record deletions in a Bronze Streaming table based on source files.

 

Change Data Capture (CDC) with Delta Live Tables:

  • CDC is a process that identifies and captures incremental changes (data deletes, inserts, and updates) in databases. Itโ€™s commonly used for near-real-time data applications.
  • In the context of Delta Live Tables, you can use CDC to update tables based on changes in source data. This feature is supported in both the SQL and Python interfaces.
  • Delta Live Tables supports two types of slowly changing dimensions (SCD):
    • SCD Type 1: Updates records directly. History is not retained for updated records.
    • SCD Type 2: Retains a history of records, either on all updates or on updates to a specified set of columns.
  • To perform CDC processing with Delta Live Tables:
    • Create a streaming table.
    • Use an APPLY CHANGES INTO statement to specify the source, keys, and sequencing for the change feed.
    • The target streaming table should be created using CREATE OR REFRESH STREAMING TABLE (SQL) or create_streaming_table() (Python).
    • Define the CDC processing using the APPLY CHANGES statement (SQL) or apply_changes() function (Python).
    • For syntax details, refer to the documentation: Change data capture with SQL in Delta Live Tables or Change data capture with Python in Delta Live Tables.

Automating Record Deletions in Bronze Streaming Table:

  • When dealing with record deletions, consider the following design considerations:
    • Metadata Capture: Include metadata info (e.g., ingestion timestamp, source file names) in the Bronze tables.
    • Tooling Considerations: Ensure the pipeline can read from sources and write into destination locations.
    • Reusability and Modular Design: Create nested parameterized pipelines to reduce development efforts.
    • Change History Maintenance:
      • Updating Changed Target Rows: Use upsert (merge) functionality to update changed rows. Requires primary keys for each table.
      • Appending Changed Source Rows: Retains change history but requires deduplication in subsequent phases.
      • Optionally, choose whether to truncate the target table.
  • The Data Mapping Flow in Delta Live Tables can address these needs effectively.
  • Manually deleting records from a streaming table is not recommended, as it can affect the pipeline. Instead, consider handling deletions through the pipeline logic.

Remember that the Bronze layer is crucial for capturing raw data changes, and thoughtful design ensures data integrity and efficient processing. ๐ŸŒŸ

APPLY CHANGES can do upserts but I have doubts regarding deleting records by key.

Does APPLY CHANGES can deletes record from bronze streaming if key is not present in source delta files.

 

Hi Manoj,

No. APPLY CHANGES does not delete the data from bonze if key/data is not present in source. It will delete it based on the value of some incoming field. Something like a status (="Delete") or such. 

If no status can be provided from source then you will need to execute the deletes once again in each layer. Make sure that you set the skipChangeCommits flag to true so the streams ignore any deletes and updates. Streaming is append only and hence does not expect any deletes or updates in source.  Link

A common example is if you are clearing out old data from the source tables, You will need to do this for all layers. DLT will not do it automatically for you. 

2vinodhkumar
New Contributor II

Hi Manoj,

Did you get the solution or design change for this problem. We have 200K files on to S3 bucket and when there is change in upstream app we get new feed, feed name is fixed. On DLT we should have only new records from replaced file but in dlt we have previously added records from same file name. As we don't have any status indicator on deleted records as these are events from upstream,we are unable to do apply_changes also we can't do full refresh as we have almost 200K files on one file replacement this full refresh takes time.

Hi Vinodh,

Seems like DLT cannot handle it on its own. 
I think of a solution which goes like this.

1) Maintain the copy of 200k files in a location (copied path). DLT should point to this copied path.

2)If there is any change in the incoming file. Run a process to insert the deleted records with status indicator as delete and copy this file over to copied path.  For inserting the deleted records you need to compare with the file in the copied path. These you can use DLT because you have status indicator.

Basically there should be a job with runs in certain interval(ex: 10mins) which tracks the changed file based on last update date and then compare with existing file in copied path to insert the deleted records with status indicator.

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group