cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How does FSCK work and does it have any negative effects on subsequent notebook executions?

CalvinCalvert_
New Contributor

In my environment, there are 3 groups of notebooks that run on their own schedules, however they all use the same underlying transaction logs (auditlogs, as we call them) in S3. From time to time, various notebooks from each of the 3 groups fail with the following error:

"Error in SQL statement: SparkException: Job aborted due to stage failure: ... Error while reading file dbfs:/mnt/<path>/<...>.snappy.parquet. A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table

DELETE
statement. ..."

Previously, we've been restoring the file from the deleted state in S3 and rerunning the notebooks with success. Studying the documentation for the error showed us FSCK as a possible solution, but I have the following questions:

1) Is Databricks marking the S3 parquet files as deleted as part of its normal work? If so, was restoring those deleted files particularly wrong or bad to do?

2) Does running FSCK to remove the transaction log file entries that can't be found anymore set us up for unintended consequences such as missing or incomplete transaction log data over time?

Thanks in advance!

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group