spark.read.parquet() - how to check for file lock before reading? (azure)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-07-2022 09:52 PM
I have some python code which takes parquet files from an adlsv2 location and merges it into delta tables (run as a workflow job on a schedule)
I have a try catch wrapper around this so that any files that fail get moved into a failed folder using dbutils.fs.mv while the files that get processed are archived off to a different location
One scenario i've encountered is this:
external upload process is uploading somefile.parquet to adlsv2
- the workflow job starts
- spark.read.parquet() fails with - Caused by: java.io.IOException: Could not read footer for file:
- dbutils.fs.mv moves the file (boo)
- the external process fails because mv has deleted the target while the upload is in progress
I'd assumed that mv would fail because there would be a exclusive lock on the file while its being uploaded but that's not the case (??)
Any suggestions on how to handle this?
Is there a way for me to check if a file is locked/being written to?
What's the error/exception to catch for this error? i've spent an hour(s) trying to figure it out but the generic python ones dont cover it and I get a nameerror for the specific spark ones I try
- Labels:
-
Azure databricks
-
Delta Tables
-
File
-
Python
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-08-2022 02:55 AM
do you have any idea on how the file would be locked? Because that should not be the case (unless the file is actually being written, so not finished yet).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-08-2022 07:33 PM
That's the problem - it's not being locked (or fs.mv() isn't checking/honoring the lock). The upload process/tool is a 3rd-prty external tool
I can see via the upload tool that the file upload is 'in progress'
I can also see the 0 byte destination file in the adlsv2 container (while its being uploaded)

