@Jordan Fox :
If you're getting an error about upstream changes, it might be because the table schema or partitioning has changed. You can try running DESCRIBE EXTENDED logs and DESCRIBE EXTENDED dedupedLogs to compare the schemas and see if there are any differences.
Yes, it is possible to perform a rolling window de-duplication in Delta Lake using the merge operation. You can merge the incremental data with the existing data in the Delta table and update or insert records based on a condition. Giving an example as below
MERGE INTO logs
USING (
SELECT id, filename, date
FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY date DESC) as rownum
FROM dedupedLogs
WHERE date > current_date() - interval 7 days
) WHERE rownum = 1
) AS d
ON logs.id = d.id AND logs.date > current_date() - interval 7 days
WHEN MATCHED THEN UPDATE SET logs.filename = d.filename
WHEN NOT MATCHED THEN INSERT (id, filename, date) VALUES (d.id, d.filename, d.date);
In this example, dedupedLogs is the table that contains the de-duplicated data for the past 7 days. We use the ROW_NUMBER() window function to assign a row number to each record within a group of records with the same id, ordered by the date column in descending order. We then select only the records with rownum = 1 to get the most recent record for each id. The MERGE INTO statement matches records in logs with records in d using the id column, and filters them by the date column. If a match is found, the filename column in logs is updated with the value from d. If there is no match, a new record is inserted into logs with the values from d.