How does FSCK work and does it have any negative effects on subsequent notebook executions?

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

In my environment, there are 3 groups of notebooks that run on their own schedules, however they all use the same underlying transaction logs (auditlogs, as we call them) in S3. From time to time, various notebooks from each of the 3 groups fail with the following error:

"Error in SQL statement: SparkException: Job aborted due to stage failure: ... Error while reading file dbfs:/mnt/<path>/<...>.snappy.parquet. A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table

DELETE

statement. ..."

Previously, we've been restoring the file from the deleted state in S3 and rerunning the notebooks with success. Studying the documentation for the error showed us FSCK as a possible solution, but I have the following questions:

1) Is Databricks marking the S3 parquet files as deleted as part of its normal work? If so, was restoring those deleted files particularly wrong or bad to do?

2) Does running FSCK to remove the transaction log file entries that can't be found anymore set us up for unintended consequences such as missing or incomplete transaction log data over time?

Thanks in advance!