10-29-2024 06:48 PM
Hi there, I ran into a peculiar case and I'm wondering if anyone else has run into this and can offer an explanation. We have a DLT process to pull CSV files from a landing location and insert (append) them into target tables. We have the setting
"cloudFiles.allowOverwrites": "true"
because it's possible (likely even) that a file will arrive the first time either empty or partially filled, and the same file can be overridden with more complete data later. We are ok with data duplication (de-duplication is handled downstream after DLT), but we are *not* ok with a file being updated and it being skipped in a later DLT insert, but this seems to be is verifiably the case every now and then (roughly ~2% of the time).
For additional context, say we have 10 daily files (file sizes can range, but let's say anywhere in the realm between a few thousand to a few million records in these files). These files can arrive and be inserted via the DLT process initially, but we expect/need for these subsequent file updates (let's say 3-4 times a day) to each be re-inserted into the target tables via the DLT process. Once a file is inserted (or re-inserted) a separate non-DLT process is kicked off to de-duplicate the data, and this is all set up either on a scheduled workflow (3-4 times a day) or on a manual run as well. While not likely or the norm, there is the potential for a manual run and a scheduled run to both be running very close or on top of each other, however, we are seeing the roughly ~2% of times when updated files are not re-inserted via DLT and I'm not sure that this edge case would explain this high of a failure rate.
Is there any known issue with the DLT process when
"cloudFiles.allowOverwrites": "true"
?
Does this issue line up with any other similar reported issues with DLT??
Any feedback on this issue/bug would be much appreciated!
10-31-2024 09:57 AM
This combination of Auto Loader caching, file deduplication, and potential concurrency overlap could be behind the ~2% miss rate you’re seeing. Checking and adjusting these areas should help improve re-insertion consistency.
If none of the above is feasible, I would like to suggest reaching out to our support team for a better use case evaluation and consider all possible options, as there many other aspects to consider here, e.g.: use of Photon, DBR release version, check if input files are mutated, the files update frequency, etc.
10-31-2024 10:02 AM
Great feedback!
Can you please provide a bit more context or example about which DLT logs to monitor? I tried looking into logs, but likely I'm not digging in the right place and the logs I found were completely overwhelming to dig through.
10-31-2024 10:10 AM
In the pipeline side panel, go to "Update Details" (seen on right hand side) and then spark logs can be seen.
10-31-2024 10:13 AM
In the bottom of the side panel, you can also see another "view logs" option, which gives you details of DLT Pipeline event log details. Click on them and a pop up will appear.
10-31-2024 10:17 AM
Meanwhile, on digging further
Autoloader, in general, is highly recommended for ingestion where the files are immutable. While there are configurations (i.e. cloudFiles.allowOverwrites = True) that allow for updated files to be re-ingested in the source, AL only guarantees only once semantics when allowOverwrites is not enabled.
Can you please help me with two details
- which DBR are you are on (please try with latest)
- Are you using Photon?
11-01-2024 07:51 AM
Hi.. I've tried to respond 3 times already but there seems to be an issue with DBX Community and each time my post shows as successful, and I refresh the page and it looks good... but then I check in later (e.g. 30 mins later) and my post is GONE! ...
I have more context, but for now to answer your direct questions:
11-01-2024 10:25 AM
Apologies, that could be the internet or networking issue.
So, in DLT you will be able to change the DBR but will have to use custom image, it may be tricky if you have not done it earlier. By default, photon will be used in serverelss.
It may be a stretch but can you try the workload, on an interactive cluster 15.3 +
Or, add these two configs in the DLT Advanced configs as a workaround (if photon is involved)
spark.databricks.photon.scan.enabled false
spark.databricks.photon.jsonScan.enabled false
Photon, not sure which
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group