Hello,
We are experiencing an error with one Structured Streaming Job that we have, that basically the checkpoint gets corrupted and we are unable to continue with the execution.
I've checked the errors and this happens when it triggers an autocompact, then something fails and the _spark_metadata folder from the path I'm writing has no files and the checkpoint get's corrupted.
This job runs once on a daily basis and stores the checkpoints on an S3 bucket. I've managed to fix this once by adding a ***** file in _spark_metadata asociated with the checkpoint commit that failed, but after 3 days I'm finding the same error again.
INFO FileStreamSinkLog: Set the compact interval to 10 [defaultCompactInterval: 10]
INFO S3AFileSystem:V3: FS_OP_CREATE BUCKET[bucketname] FILE[s3://bucketname/processname/2022/12/1/5/_spark_metadata/19.compact] Creating output stream; permission: { masked: rw-r--r--, unmasked: rw-rw-rw- }, overwrite: false, bufferSize: 65536
INFO S3AFileSystem:V3: spark.databricks.io.parquet.verifyChecksumOnWrite.enabled is disabled
INFO S3ABlockOutputStream:V3: FS_OP_CREATE BUCKET[bucketname] FILE[bucketname/processname/2022/12/1/5/_spark_metadata/19.compact] Cancelling stream
22/12/01 05:12:34 INFO S3ABlockOutputStream:V3: FS_OP_CREATE BUCKET[bucketname] FILE[processname/2022/12/1/5/_spark_metadata/19.compact] Successfully cancelled stream
INFO AWSCheckpointFileManager: Cancelled writing to path
ERROR FileFormatWriter: Aborting job 58d6f7d7-8ef4-4c8c-9546-1ccf5fc477d9.
java.io.FileNotFoundException: Unable to find batch s3://bucketname/processname/2022/12/1/5/_spark_metadata/9.compact
Things that I don't know:
- I've found a couple of posts stating to avoid using S3 for checkpoints from 2017 due to "eventual consistency" from S3, but also found newer ones that say that this has been fixed.
- I've also found some posts that Databricks handles this from behind the scenes, but couldn't find anything on Databricks documentation.
- We also have other Streaming jobs that run daily but never experienced this issue. The only difference with this one is that the first step is to replicate the files from source to our landing S3 and that's where it fails.
Related posts: