cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Structured Streaming Checkpoint corrupted.

mriccardi
New Contributor II

Hello,

We are experiencing an error with one Structured Streaming Job that we have, that basically the checkpoint gets corrupted and we are unable to continue with the execution.

I've checked the errors and this happens when it triggers an autocompact, then something fails and the _spark_metadata folder from the path I'm writing has no files and the checkpoint get's corrupted.

This job runs once on a daily basis and stores the checkpoints on an S3 bucket. I've managed to fix this once by adding a ***** file in _spark_metadata asociated with the checkpoint commit that failed, but after 3 days I'm finding the same error again.

INFO FileStreamSinkLog: Set the compact interval to 10 [defaultCompactInterval: 10]
INFO S3AFileSystem:V3: FS_OP_CREATE BUCKET[bucketname] FILE[s3://bucketname/processname/2022/12/1/5/_spark_metadata/19.compact] Creating output stream; permission: { masked: rw-r--r--, unmasked: rw-rw-rw- }, overwrite: false, bufferSize: 65536
INFO S3AFileSystem:V3: spark.databricks.io.parquet.verifyChecksumOnWrite.enabled is disabled
INFO S3ABlockOutputStream:V3: FS_OP_CREATE BUCKET[bucketname] FILE[bucketname/processname/2022/12/1/5/_spark_metadata/19.compact] Cancelling stream
22/12/01 05:12:34 INFO S3ABlockOutputStream:V3: FS_OP_CREATE BUCKET[bucketname] FILE[processname/2022/12/1/5/_spark_metadata/19.compact] Successfully cancelled stream
 INFO AWSCheckpointFileManager: Cancelled writing to path
ERROR FileFormatWriter: Aborting job 58d6f7d7-8ef4-4c8c-9546-1ccf5fc477d9.
java.io.FileNotFoundException: Unable to find batch s3://bucketname/processname/2022/12/1/5/_spark_metadata/9.compact

Things that I don't know:

  • I've found a couple of posts stating to avoid using S3 for checkpoints from 2017 due to "eventual consistency" from S3, but also found newer ones that say that this has been fixed.
  • I've also found some posts that Databricks handles this from behind the scenes, but couldn't find anything on Databricks documentation.
  • We also have other Streaming jobs that run daily but never experienced this issue. The only difference with this one is that the first step is to replicate the files from source to our landing S3 and that's where it fails.

Related posts:

1 REPLY 1

jose_gonzalez
Moderator
Moderator

Hi @Martin Riccardi​,

Could you share the following please:

1) whats your Source?

2) whats your Sink?

3) could you share your readStream() and writeStream() code?

4) full error stack trace

5) did you stop and re-run your query after weeks of not being active?

6) did you change anything in your checkpoint folder?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.