Re: Spark streaming failing intermittently with Fi...

susmitsircar · ‎08-12-2025

We are encountering an issue in our Spark streaming pipeline when attempting to write checkpoint data to S3. The error we are seeing is as follows:

25/08/12 13:35:40 ERROR RocksDBFileManager : Error zipping to s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip
java.nio.file.FileAlreadyExistsException: s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip

25/08/12 13:35:40 ERROR RocksDBFileManager : Error zipping to s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip
java.nio.file.FileAlreadyExistsException: s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip

It seems that the file already exists at the destination path in S3 (0.zip), causing the job to fail. This issue is hindering our checkpointing process during streaming.

Could you please advise on how to best address this issue or suggest any best practices for handling such file conflicts in streaming jobs? Additionally, if there are any configuration adjustments or features that could help us mitigate this error, that would be greatly appreciated.

Looking forward to your guidance.

lingareddy_Alva · ‎08-12-2025

Hi @susmitsircar

Best practices / Fixes
1. Clean up the checkpoint directory before restart
If you know the stream can safely start from scratch or reprocess data:
Delete the S3 checkpoint path before restarting.
This ensures no stale 0.zip files remain.
dbutils.fs.rm(checkpoint_path, True)

2. Use a unique checkpoint location per stream
Ensure each distinct streaming query (even if reading from the same source) has a unique checkpoint path.
This avoids conflicts from multiple jobs writing to the same RocksDB state store location.

checkpoint_path = f"s3://xxx-datalake-binary/event-types/checkpoint/{uuid.uuid4()}"

3. Enable state store cleanup
You can configure Structured Streaming to clean up old state files, reducing leftover file conflicts:
spark.conf.set("spark.sql.streaming.stateStore.rocksdb.cleanupCheckpoint", "true")

Recommended approach for production
Always use unique checkpoint locations for each streaming application
Clean up checkpoints from failed runs before restarting

LR

susmitsircar · ‎08-12-2025

Thanks for your reply i don't found this parameter in public docs, can you guide me where can I find

spark.conf.set("spark.sql.streaming.stateStore.rocksdb.cleanupCheckpoint", "true")

lingareddy_Alva · ‎08-12-2025

Sorry wrongly stated. Check below ones if this works.

spark.conf.set(
"spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled",
"true"
)

https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/rocksdb-state-store?utm_sour...

LR

Spark streaming failing intermittently with FileAlreadyExistsException RocksDB checkpointing