DLT Potential Bug: File Reprocessing Issue with "cloudFiles.allowOverwrites": "true"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-29-2024 06:48 PM
Hi there, I ran into a peculiar case and I'm wondering if anyone else has run into this and can offer an explanation. We have a DLT process to pull CSV files from a landing location and insert (append) them into target tables. We have the setting
"cloudFiles.allowOverwrites": "true"because it's possible (likely even) that a file will arrive the first time either empty or partially filled, and the same file can be overridden with more complete data later. We are ok with data duplication (de-duplication is handled downstream after DLT), but we are *not* ok with a file being updated and it being skipped in a later DLT insert, but this seems to be is verifiably the case every now and then (roughly ~2% of the time).
For additional context, say we have 10 daily files (file sizes can range, but let's say anywhere in the realm between a few thousand to a few million records in these files). These files can arrive and be inserted via the DLT process initially, but we expect/need for these subsequent file updates (let's say 3-4 times a day) to each be re-inserted into the target tables via the DLT process. Once a file is inserted (or re-inserted) a separate non-DLT process is kicked off to de-duplicate the data, and this is all set up either on a scheduled workflow (3-4 times a day) or on a manual run as well. While not likely or the norm, there is the potential for a manual run and a scheduled run to both be running very close or on top of each other, however, we are seeing the roughly ~2% of times when updated files are not re-inserted via DLT and I'm not sure that this edge case would explain this high of a failure rate.
Is there any known issue with the DLT process when
"cloudFiles.allowOverwrites": "true"?
Does this issue line up with any other similar reported issues with DLT??
Any feedback on this issue/bug would be much appreciated!