In my environment, there are 3 groups of notebooks that run on their own schedules, however they all use the same underlying transaction logs (auditlogs, as we call them) in S3. From time to time, various notebooks from each of the 3 groups fail with the following error:
"Error in SQL statement: SparkException: Job aborted due to stage failure: ... Error while reading file dbfs:/mnt/<path>/<...>.snappy.parquet. A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table
DELETE
statement. ..."
Previously, we've been restoring the file from the deleted state in S3 and rerunning the notebooks with success. Studying the documentation for the error showed us FSCK as a possible solution, but I have the following questions:
1) Is Databricks marking the S3 parquet files as deleted as part of its normal work? If so, was restoring those deleted files particularly wrong or bad to do?
2) Does running FSCK to remove the transaction log file entries that can't be found anymore set us up for unintended consequences such as missing or incomplete transaction log data over time?
Thanks in advance!