I am building a data pipeline using Delta Live table in Azure Databricks to move data from a raw data table to a feature table and model inference results table. However, I am concerned about the potential for duplication issues in future operations, such as data recovery.
My data source is a relational database that is updated daily in batches. Data from the previous day is appended to each table in the database. Since the JDBC source cannot be used as a streaming source, and we do not want to enable CDC in the database, I have built a job that ingests daily new data into a staging Delta table and passes it incrementally to the Delta Live table with the "ignoreChanges" option set to True to create a series of streaming tables every day.
However, sometimes we have bad data and the data source will update some data within a certain period (date). I am planning an operation to delete data from these dates in the raw data table and re-ingest data from the database to the staging table. This will pass updated data to downstream Delta Live tables, so I also need to delete the data from these dates in all downstream tables. Since the data pipeline will change over time, handling duplicates or monitoring duplicates using Data Quality definitions in the Delta Live table seems necessary.
I have considered dropDuplicates(["Date", "User_ID"]), but I don't think it's useful in my case. I have also thought of three other ways to handle this:
- Handling duplications in the definition of DLT table. Such as left join streaming data with the whole table itself using columns ["Date", "User_ID"], and dropping those data that have already been recorded in the streaming table. However, updated data will be dropped without notification and this method requires a lot of computing power. Additionally, I don't need to check on a User_ID level.
- Define a data quality rule in DLT that will warn or drop data if there is already existing data with the same date in the tables. However, I am unsure of how to do this or if it is possible.
- Create a monitoring table or view to monitor the number of counts on each date. If the counts double on some date, it will notify us to operate by hand.
I would like to know if there is a better way to handle this than the third method.
Please let me know if my description is unclear or if you have any questions.
Thank you for taking the time to read my long question, and any suggestions would be greatly appreciated.