cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How does FSCK work and does it have any negative effects on subsequent notebook executions?

CalvinCalvert_
New Contributor

In my environment, there are 3 groups of notebooks that run on their own schedules, however they all use the same underlying transaction logs (auditlogs, as we call them) in S3. From time to time, various notebooks from each of the 3 groups fail with the following error:

"Error in SQL statement: SparkException: Job aborted due to stage failure: ... Error while reading file dbfs:/mnt/<path>/<...>.snappy.parquet. A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table

DELETE
statement. ..."

Previously, we've been restoring the file from the deleted state in S3 and rerunning the notebooks with success. Studying the documentation for the error showed us FSCK as a possible solution, but I have the following questions:

1) Is Databricks marking the S3 parquet files as deleted as part of its normal work? If so, was restoring those deleted files particularly wrong or bad to do?

2) Does running FSCK to remove the transaction log file entries that can't be found anymore set us up for unintended consequences such as missing or incomplete transaction log data over time?

Thanks in advance!

0 REPLIES 0
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.