cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Structured Streaming Checkpoint corrupted.

mriccardi
New Contributor II

Hello,

We are experiencing an error with one Structured Streaming Job that we have, that basically the checkpoint gets corrupted and we are unable to continue with the execution.

I've checked the errors and this happens when it triggers an autocompact, then something fails and the _spark_metadata folder from the path I'm writing has no files and the checkpoint get's corrupted.

This job runs once on a daily basis and stores the checkpoints on an S3 bucket. I've managed to fix this once by adding a ***** file in _spark_metadata asociated with the checkpoint commit that failed, but after 3 days I'm finding the same error again.

INFO FileStreamSinkLog: Set the compact interval to 10 [defaultCompactInterval: 10]
INFO S3AFileSystem:V3: FS_OP_CREATE BUCKET[bucketname] FILE[s3://bucketname/processname/2022/12/1/5/_spark_metadata/19.compact] Creating output stream; permission: { masked: rw-r--r--, unmasked: rw-rw-rw- }, overwrite: false, bufferSize: 65536
INFO S3AFileSystem:V3: spark.databricks.io.parquet.verifyChecksumOnWrite.enabled is disabled
INFO S3ABlockOutputStream:V3: FS_OP_CREATE BUCKET[bucketname] FILE[bucketname/processname/2022/12/1/5/_spark_metadata/19.compact] Cancelling stream
22/12/01 05:12:34 INFO S3ABlockOutputStream:V3: FS_OP_CREATE BUCKET[bucketname] FILE[processname/2022/12/1/5/_spark_metadata/19.compact] Successfully cancelled stream
 INFO AWSCheckpointFileManager: Cancelled writing to path
ERROR FileFormatWriter: Aborting job 58d6f7d7-8ef4-4c8c-9546-1ccf5fc477d9.
java.io.FileNotFoundException: Unable to find batch s3://bucketname/processname/2022/12/1/5/_spark_metadata/9.compact

Things that I don't know:

  • I've found a couple of posts stating to avoid using S3 for checkpoints from 2017 due to "eventual consistency" from S3, but also found newer ones that say that this has been fixed.
  • I've also found some posts that Databricks handles this from behind the scenes, but couldn't find anything on Databricks documentation.
  • We also have other Streaming jobs that run daily but never experienced this issue. The only difference with this one is that the first step is to replicate the files from source to our landing S3 and that's where it fails.

Related posts:

1 REPLY 1

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Martin Riccardiโ€‹,

Could you share the following please:

1) whats your Source?

2) whats your Sink?

3) could you share your readStream() and writeStream() code?

4) full error stack trace

5) did you stop and re-run your query after weeks of not being active?

6) did you change anything in your checkpoint folder?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group