Databricks Community

ChristianRRL · ‎10-29-2024

Hi there, I ran into a peculiar case and I'm wondering if anyone else has run into this and can offer an explanation. We have a DLT process to pull CSV files from a landing location and insert (append) them into target tables. We have the setting

"cloudFiles.allowOverwrites": "true"

because it's possible (likely even) that a file will arrive the first time either empty or partially filled, and the same file can be overridden with more complete data later. We are ok with data duplication (de-duplication is handled downstream after DLT), but we are *not* ok with a file being updated and it being skipped in a later DLT insert, but this ~~seems to be~~ is verifiably the case every now and then (roughly ~2% of the time).

For additional context, say we have 10 daily files (file sizes can range, but let's say anywhere in the realm between a few thousand to a few million records in these files). These files can arrive and be inserted via the DLT process initially, but we expect/need for these subsequent file updates (let's say 3-4 times a day) to each be re-inserted into the target tables via the DLT process. Once a file is inserted (or re-inserted) a separate non-DLT process is kicked off to de-duplicate the data, and this is all set up either on a scheduled workflow (3-4 times a day) or on a manual run as well. While not likely or the norm, there is the potential for a manual run and a scheduled run to both be running very close or on top of each other, however, we are seeing the roughly ~2% of times when updated files are not re-inserted via DLT and I'm not sure that this edge case would explain this high of a failure rate.

Is there any known issue with the DLT process when

"cloudFiles.allowOverwrites": "true"

?

Does this issue line up with any other similar reported issues with DLT??

Any feedback on this issue/bug would be much appreciated!

VZLA · ‎10-31-2024

Ensure Unique Timestamps: Verify that each file update includes a unique modification timestamp, as this can help DLT detect and reprocess updated files.
Use cloudFiles.validateOptions: Set "cloudFiles.validateOptions": "true" to help DLT verify files more strictly against changes.
Monitor DLT Logs: Check the logs to confirm that DLT is detecting each file update. Any skipped file updates should log a reason, which may help pinpoint the cause.

This combination of Auto Loader caching, file deduplication, and potential concurrency overlap could be behind the ~2% miss rate you’re seeing. Checking and adjusting these areas should help improve re-insertion consistency.

If none of the above is feasible, I would like to suggest reaching out to our support team for a better use case evaluation and consider all possible options, as there many other aspects to consider here, e.g.: use of Photon, DBR release version, check if input files are mutated, the files update frequency, etc.

ChristianRRL · ‎10-31-2024

Great feedback!

Can you please provide a bit more context or example about which DLT logs to monitor? I tried looking into logs, but likely I'm not digging in the right place and the logs I found were completely overwhelming to dig through.

NandiniN · ‎10-31-2024

In the pipeline side panel, go to "Update Details" (seen on right hand side) and then spark logs can be seen.

Screenshot 2024-10-31 at 10.39.16 PM.png

NandiniN · ‎10-31-2024

In the bottom of the side panel, you can also see another "view logs" option, which gives you details of DLT Pipeline event log details. Click on them and a pop up will appear.

Screenshot 2024-10-31 at 10.41.09 PM.png Screenshot 2024-10-31 at 10.42.00 PM.png

NandiniN · ‎10-31-2024

Meanwhile, on digging further

Autoloader, in general, is highly recommended for ingestion where the files are immutable. While there are configurations (i.e. cloudFiles.allowOverwrites = True) that allow for updated files to be re-ingested in the source, AL only guarantees only once semantics when allowOverwrites is not enabled.

Can you please help me with two details

- which DBR are you are on (please try with latest)

- Are you using Photon?

ChristianRRL · ‎11-01-2024

Hi.. I've tried to respond 3 times already but there seems to be an issue with DBX Community and each time my post shows as successful, and I refresh the page and it looks good... but then I check in later (e.g. 30 mins later) and my post is GONE! ...

I have more context, but for now to answer your direct questions:

dlt:14.1.21-delta-pipelines-dlt-release-2024.40-rc0-commit-b9997c9-image-963030d
Not using Photon

NandiniN · ‎11-01-2024

Apologies, that could be the internet or networking issue.

So, in DLT you will be able to change the DBR but will have to use custom image, it may be tricky if you have not done it earlier. By default, photon will be used in serverelss.

It may be a stretch but can you try the workload, on an interactive cluster 15.3 +

Or, add these two configs in the DLT Advanced configs as a workaround (if photon is involved)

spark.databricks.photon.scan.enabled false
spark.databricks.photon.jsonScan.enabled false

Photon, not sure which

Databricks Community

DLT Potential Bug: File Reprocessing Issue with "cloudFiles.allowOverwrites": "true"

Photos

Connect with Databricks Users in Your Area

Virtual Learning Festival: 9 April - 30 April

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Data + AI Summit 2025 — registration now open!

Databricks DevConnect: Global Community Meetups for Data Engineers

Databricks Community Champion - February 2025 - Stefan Koch