Spark streaming failing intermittently with FileAlreadyExistsException RocksDB checkpointing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-12-2025 06:55 AM
We are encountering an issue in our Spark streaming pipeline when attempting to write checkpoint data to S3. The error we are seeing is as follows:
25/08/12 13:35:40 ERROR RocksDBFileManager : Error zipping to s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip
java.nio.file.FileAlreadyExistsException: s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip25/08/12 13:35:40 ERROR RocksDBFileManager : Error zipping to s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip
java.nio.file.FileAlreadyExistsException: s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip
It seems that the file already exists at the destination path in S3 (0.zip), causing the job to fail. This issue is hindering our checkpointing process during streaming.
Could you please advise on how to best address this issue or suggest any best practices for handling such file conflicts in streaming jobs? Additionally, if there are any configuration adjustments or features that could help us mitigate this error, that would be greatly appreciated.
Looking forward to your guidance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-12-2025 09:00 AM
Best practices / Fixes
1. Clean up the checkpoint directory before restart
If you know the stream can safely start from scratch or reprocess data:
Delete the S3 checkpoint path before restarting.
This ensures no stale 0.zip files remain.
dbutils.fs.rm(checkpoint_path, True)
2. Use a unique checkpoint location per stream
Ensure each distinct streaming query (even if reading from the same source) has a unique checkpoint path.
This avoids conflicts from multiple jobs writing to the same RocksDB state store location.
checkpoint_path = f"s3://xxx-datalake-binary/event-types/checkpoint/{uuid.uuid4()}"
3. Enable state store cleanup
You can configure Structured Streaming to clean up old state files, reducing leftover file conflicts:
spark.conf.set("spark.sql.streaming.stateStore.rocksdb.cleanupCheckpoint", "true")
Recommended approach for production
Always use unique checkpoint locations for each streaming application
Clean up checkpoints from failed runs before restarting
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-12-2025 09:49 AM
Thanks for your reply i don't found this parameter in public docs, can you guide me where can I find
spark.conf.set("spark.sql.streaming.stateStore.rocksdb.cleanupCheckpoint", "true")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-12-2025 10:33 AM
Sorry wrongly stated. Check below ones if this works.
spark.conf.set(
"spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled",
"true"
)