<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Spark streaming failing intermittently with FileAlreadyExistsException RocksDB checkpointing in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-streaming-failing-intermittently-with/m-p/128221#M48184</link>
    <description>&lt;P&gt;We are encountering an issue in our Spark streaming pipeline when attempting to write checkpoint data to S3. The error we are seeing is as follows:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;25/08/12 13:35:40 ERROR RocksDBFileManager : Error zipping to s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip
java.nio.file.FileAlreadyExistsException: s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip&lt;/LI-CODE&gt;&lt;P&gt;25/08/12 13:35:40 ERROR RocksDBFileManager : Error zipping to s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip&lt;BR /&gt;java.nio.file.FileAlreadyExistsException: s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip&lt;/P&gt;&lt;P&gt;It seems that the file already exists at the destination path in S3 (0.zip), causing the job to fail. This issue is hindering our checkpointing process during streaming.&lt;/P&gt;&lt;P&gt;Could you please advise on how to best address this issue or suggest any best practices for handling such file conflicts in streaming jobs? Additionally, if there are any configuration adjustments or features that could help us mitigate this error, that would be greatly appreciated.&lt;/P&gt;&lt;P&gt;Looking forward to your guidance.&lt;/P&gt;</description>
    <pubDate>Tue, 12 Aug 2025 13:55:29 GMT</pubDate>
    <dc:creator>susmitsircar</dc:creator>
    <dc:date>2025-08-12T13:55:29Z</dc:date>
    <item>
      <title>Spark streaming failing intermittently with FileAlreadyExistsException RocksDB checkpointing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-failing-intermittently-with/m-p/128221#M48184</link>
      <description>&lt;P&gt;We are encountering an issue in our Spark streaming pipeline when attempting to write checkpoint data to S3. The error we are seeing is as follows:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;25/08/12 13:35:40 ERROR RocksDBFileManager : Error zipping to s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip
java.nio.file.FileAlreadyExistsException: s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip&lt;/LI-CODE&gt;&lt;P&gt;25/08/12 13:35:40 ERROR RocksDBFileManager : Error zipping to s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip&lt;BR /&gt;java.nio.file.FileAlreadyExistsException: s3://xxx-datalake-binary/event-types/checkpoint/eventlog.e682ad_batch_pool_generator_input_pool_reduced/sources/0/rocksdb/0.zip&lt;/P&gt;&lt;P&gt;It seems that the file already exists at the destination path in S3 (0.zip), causing the job to fail. This issue is hindering our checkpointing process during streaming.&lt;/P&gt;&lt;P&gt;Could you please advise on how to best address this issue or suggest any best practices for handling such file conflicts in streaming jobs? Additionally, if there are any configuration adjustments or features that could help us mitigate this error, that would be greatly appreciated.&lt;/P&gt;&lt;P&gt;Looking forward to your guidance.&lt;/P&gt;</description>
      <pubDate>Tue, 12 Aug 2025 13:55:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-failing-intermittently-with/m-p/128221#M48184</guid>
      <dc:creator>susmitsircar</dc:creator>
      <dc:date>2025-08-12T13:55:29Z</dc:date>
    </item>
    <item>
      <title>Re: Spark streaming failing intermittently with FileAlreadyExistsException RocksDB checkpointing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-failing-intermittently-with/m-p/128248#M48187</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/176996"&gt;@susmitsircar&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Best practices / Fixes&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;1. Clean up the checkpoint directory before restart&lt;/STRONG&gt;&lt;BR /&gt;If you know the stream can safely start from scratch or reprocess data:&lt;BR /&gt;Delete the S3 checkpoint path before restarting.&lt;BR /&gt;This ensures no stale 0.zip files remain.&lt;BR /&gt;dbutils.fs.rm(checkpoint_path, True)&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;2. Use a unique checkpoint location per stream&lt;/STRONG&gt;&lt;BR /&gt;Ensure each distinct streaming query (even if reading from the same source) has a unique checkpoint path.&lt;BR /&gt;This avoids conflicts from multiple jobs writing to the same RocksDB state store location.&lt;/P&gt;&lt;P&gt;checkpoint_path = f"s3://xxx-datalake-binary/event-types/checkpoint/{uuid.uuid4()}"&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;3. Enable state store cleanup&lt;/STRONG&gt;&lt;BR /&gt;You can configure Structured Streaming to clean up old state files, reducing leftover file conflicts:&lt;BR /&gt;spark.conf.set("spark.sql.streaming.stateStore.rocksdb.cleanupCheckpoint", "true")&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Recommended approach for production&lt;/STRONG&gt;&lt;BR /&gt;Always use unique checkpoint locations for each streaming application&lt;BR /&gt;Clean up checkpoints from failed runs before restarting&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Aug 2025 16:00:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-failing-intermittently-with/m-p/128248#M48187</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-08-12T16:00:09Z</dc:date>
    </item>
    <item>
      <title>Re: Spark streaming failing intermittently with FileAlreadyExistsException RocksDB checkpointing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-failing-intermittently-with/m-p/128262#M48192</link>
      <description>&lt;P&gt;Thanks for your reply i don't found this parameter in public docs, can you guide me where can I find&lt;/P&gt;&lt;P&gt;spark.conf.set("spark.sql.streaming.stateStore.rocksdb.cleanupCheckpoint", "true")&lt;/P&gt;</description>
      <pubDate>Tue, 12 Aug 2025 16:49:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-failing-intermittently-with/m-p/128262#M48192</guid>
      <dc:creator>susmitsircar</dc:creator>
      <dc:date>2025-08-12T16:49:43Z</dc:date>
    </item>
    <item>
      <title>Re: Spark streaming failing intermittently with FileAlreadyExistsException RocksDB checkpointing</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-failing-intermittently-with/m-p/128269#M48193</link>
      <description>&lt;P&gt;Sorry wrongly stated. Check below ones if this works.&lt;/P&gt;&lt;P&gt;spark.conf.set(&lt;BR /&gt;"spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled",&lt;BR /&gt;"true"&lt;BR /&gt;)&lt;/P&gt;&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/rocksdb-state-store?utm_source=chatgpt.com" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/rocksdb-state-store?utm_source=chatgpt.com&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Aug 2025 17:33:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-failing-intermittently-with/m-p/128269#M48193</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-08-12T17:33:07Z</dc:date>
    </item>
  </channel>
</rss>

