cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark streaming failing intermittently with FileAlreadyExistsException RocksDB checkpointing

susmitsircar
New Contributor III

We are encountering an issue in our Spark streaming pipeline when attempting to write checkpoint data to S3. The error we are seeing is as follows:

25/08/12 13:35:40 ERROR RocksDBFileManager : Error zipping to s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip
java.nio.file.FileAlreadyExistsException: s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip

25/08/12 13:35:40 ERROR RocksDBFileManager : Error zipping to s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip
java.nio.file.FileAlreadyExistsException: s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip

It seems that the file already exists at the destination path in S3 (0.zip), causing the job to fail. This issue is hindering our checkpointing process during streaming.

Could you please advise on how to best address this issue or suggest any best practices for handling such file conflicts in streaming jobs? Additionally, if there are any configuration adjustments or features that could help us mitigate this error, that would be greatly appreciated.

Looking forward to your guidance.

3 REPLIES 3

lingareddy_Alva
Honored Contributor III

Hi @susmitsircar 

Best practices / Fixes
1. Clean up the checkpoint directory before restart
If you know the stream can safely start from scratch or reprocess data:
Delete the S3 checkpoint path before restarting.
This ensures no stale 0.zip files remain.
dbutils.fs.rm(checkpoint_path, True)

2. Use a unique checkpoint location per stream
Ensure each distinct streaming query (even if reading from the same source) has a unique checkpoint path.
This avoids conflicts from multiple jobs writing to the same RocksDB state store location.

checkpoint_path = f"s3://xxx-datalake-binary/event-types/checkpoint/{uuid.uuid4()}"

3. Enable state store cleanup
You can configure Structured Streaming to clean up old state files, reducing leftover file conflicts:
spark.conf.set("spark.sql.streaming.stateStore.rocksdb.cleanupCheckpoint", "true")

Recommended approach for production
Always use unique checkpoint locations for each streaming application
Clean up checkpoints from failed runs before restarting

 

 

LR

Thanks for your reply i don't found this parameter in public docs, can you guide me where can I find

spark.conf.set("spark.sql.streaming.stateStore.rocksdb.cleanupCheckpoint", "true")

Sorry wrongly stated. Check below ones if this works.

spark.conf.set(
"spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled",
"true"
)

https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/rocksdb-state-store?utm_sour...

LR

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now