topic Re: spark.read.parquet() - how to check for file lock before reading? (azure) in Data Engineering

spark.read.parquet() - how to check for file lock before reading? (azure)

jakubk — Thu, 08 Sep 2022 04:52:25 GMT

I have some python code which takes parquet files from an adlsv2 location and merges it into delta tables (run as a workflow job on a schedule)

I have a try catch wrapper around this so that any files that fail get moved into a failed folder using dbutils.fs.mv while the files that get processed are archived off to a different location

One scenario i've encountered is this:

external upload process is uploading somefile.parquet to adlsv2

- the workflow job starts

- spark.read.parquet() fails with - Caused by: java.io.IOException: Could not read footer for file:

- dbutils.fs.mv moves the file (boo)

- the external process fails because mv has deleted the target while the upload is in progress

I'd assumed that mv would fail because there would be a exclusive lock on the file while its being uploaded but that's not the case (??)

Any suggestions on how to handle this?

Is there a way for me to check if a file is locked/being written to?

What's the error/exception to catch for this error? i've spent an hour(s) trying to figure it out but the generic python ones dont cover it and I get a nameerror for the specific spark ones I try

Re: spark.read.parquet() - how to check for file lock before reading? (azure)

-werners- — Thu, 08 Sep 2022 09:55:49 GMT

do you have any idea on how the file would be locked? Because that should not be the case (unless the file is actually being written, so not finished yet).

Re: spark.read.parquet() - how to check for file lock before reading? (azure)

jakubk — Fri, 09 Sep 2022 02:33:57 GMT

That's the problem - it's not being locked (or fs.mv() isn't checking/honoring the lock). The upload process/tool is a 3rd-prty external tool

I can see via the upload tool that the file upload is 'in progress'

I can also see the 0 byte destination file in the adlsv2 container (while its being uploaded)