cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best practice of handling Data Duplication in a Delta Live Table Pipeline with daily batch data source.

Chengcheng
New Contributor III

I am building a data pipeline using Delta Live table in Azure Databricks to move data from a raw data table to a feature table and model inference results table. However, I am concerned about the potential for duplication issues in future operations, such as data recovery.

My data source is a relational database that is updated daily in batches. Data from the previous day is appended to each table in the database. Since the JDBC source cannot be used as a streaming source, and we do not want to enable CDC in the database, I have built a job that ingests daily new data into a staging Delta table and passes it incrementally to the Delta Live table with the "ignoreChanges" option set to True to create a series of streaming tables every day.

However, sometimes we have bad data and the data source will update some data within a certain period (date). I am planning an operation to delete data from these dates in the raw data table and re-ingest data from the database to the staging table. This will pass updated data to downstream Delta Live tables, so I also need to delete the data from these dates in all downstream tables. Since the data pipeline will change over time, handling duplicates or monitoring duplicates using Data Quality definitions in the Delta Live table seems necessary.

I have considered dropDuplicates(["Date", "User_ID"]), but I don't think it's useful in my case. I have also thought of three other ways to handle this:

  1. Handling duplications in the definition of DLT table. Such as left join streaming data with the whole table itself using columns ["Date", "User_ID"], and dropping those data that have already been recorded in the streaming table. However, updated data will be dropped without notification and this method requires a lot of computing power. Additionally, I don't need to check on a User_ID level.
  2. Define a data quality rule in DLT that will warn or drop data if there is already existing data with the same date in the tables. However, I am unsure of how to do this or if it is possible.
  3. Create a monitoring table or view to monitor the number of counts on each date. If the counts double on some date, it will notify us to operate by hand.

I would like to know if there is a better way to handle this than the third method.

Please let me know if my description is unclear or if you have any questions.

Thank you for taking the time to read my long question, and any suggestions would be greatly appreciated.

1 REPLY 1

Anonymous
Not applicable

Hi @Chengcheng Guo​ 

Great to meet you, and thanks for your question!

Let's see if your peers in the community have an answer to your question. Thanks.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group