- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-09-2024 01:09 AM
Hi @Retired_mod
thank you spending time on this :).
The DROPMALFORMED option will silently ignore and drop malformed lines, which is not really an applicable option in a pipeline, at least not in all the use cases I have worked in so far.
A potential solution could be to use the badrecordspath (Handle bad records and files | Databricks on AWS), which, again, process all the files and silently write out bad records in jsons.
I'm not sure it works with COPY INTO, but the point is that I want a *single* file to fully fail if a malformed line is found, and thus a flow to achieve this should
- check of the existence of the bad records files (written using a timestamp as name...).
- get the filename of the offended line(s)
- revert the writing of that single file (which is not possible with simple commands)
One could also use the additional column "rescuedDataColumn" (Read and write to CSV files | Databricks on AWS) (again, I have to check with COPY INTO), but it wouldn't solve everything because you cannot revert the wrong file when it happens.
You may ask "why do you want a full file to fail in case of a single error?", which is what is triggering so many issues and questions here. The answer is that in my case
- wrong files are rare and errors should be investigated (and reported). The safest thing is to considered a faulted file as not reliable
- an error like a newline in a field will likely trigger 1 error line and not 2. The first portion of the line is "usually" considered correct (with the rest of the line filled with Nones), and loaded into the final table, whereas the second part is considered faulty. It's not possible to connect the two, and one should ignore the first part as well.
All the issues above can be solved using algos that read the files more than once (e.g. file check and then writing), and/or adding many layers of complexity to the COPY INTO command... or with the implementation of a "ignoreMalformedFiles" with some meaningful and easy log. That's why I was looking for a solution like this.