cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

spark.read.parquet() - how to check for file lock before reading? (azure)

jakubk
Contributor

I have some python code which takes parquet files from an adlsv2 location and merges it into delta tables (run as a workflow job on a schedule)

I have a try catch wrapper around this so that any files that fail get moved into a failed folder using dbutils.fs.mv while the files that get processed are archived off to a different location

One scenario i've encountered is this:

external upload process is uploading somefile.parquet to adlsv2

- the workflow job starts

- spark.read.parquet() fails with - Caused by: java.io.IOException: Could not read footer for file:

- dbutils.fs.mv moves the file (boo)

- the external process fails because mv has deleted the target while the upload is in progress

I'd assumed that mv would fail because there would be a exclusive lock on the file while its being uploaded but that's not the case (??)

Any suggestions on how to handle this?

Is there a way for me to check if a file is locked/being written to?

What's the error/exception to catch for this error? i've spent an hour(s) trying to figure it out but the generic python ones dont cover it and I get a nameerror for the specific spark ones I try

2 REPLIES 2

-werners-
Esteemed Contributor III

do you have any idea on how the file would be locked? Because that should not be the case (unless the file is actually being written, so not finished yet).

jakubk
Contributor

That's the problem - it's not being locked (or fs.mv() isn't checking/honoring the lock). The upload process/tool is a 3rd-prty external tool

I can see via the upload tool that the file upload is 'in progress'

I can also see the 0 byte destination file in the adlsv2 container (while its being uploaded)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group