<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Spark Streaming - Checkpoint State EOF Exception in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30777#M22344</link>
    <description>&lt;P&gt;Hi @Jose Gonzalez​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks for the reply. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Write stream is as follows&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;dataset.writeStream
                 .format("delta")
                 .outputMode("append")
                 .option("checkpointLocation", "dbfs:/checkpoints_v1/&amp;lt;table_name&amp;gt;")
                 .option("mergeSchema", "true")
                 .table("&amp;lt;table_name&amp;gt;")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The Dataset is created after fetching records from delta tables in a stream and applying the flatMapGroupsWithState API on the records.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Few Observations:&lt;/P&gt;&lt;P&gt;1) The probability of the error occurring is more at the higher loads. If the input rate for the flatMapGroupsWithState API is 12000 records/second, then then the error occurs regularly and generally within 2 hours of the start of the job. For lower loads of 4000 records / seconds, the error occurs infrequently .&lt;/P&gt;&lt;P&gt;2) The size of the snapshot file for which the error occurs is 0 bytes.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Let me know if you require any other information.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Rohan&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 09 Feb 2022 04:45:03 GMT</pubDate>
    <dc:creator>RohanB</dc:creator>
    <dc:date>2022-02-09T04:45:03Z</dc:date>
    <item>
      <title>Spark Streaming - Checkpoint State EOF Exception</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30773#M22340</link>
      <description>&lt;P&gt;I have a Spark Structured Streaming job which reads from 2 Delta tables in streams , processes the data and then writes to a 3rd Delta table. The job is being run with the Databricks service on GCP.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Sometimes the job fails with the following exception. The exception is intermittent and so far I have not been able to come up with the steps to reproduce the issue. What can be the cause for this exception?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Caused by: java.io.IOException: Error reading streaming state file dbfs:/checkpoint_loc/state/0/183/26.snapshot of HDFSStateStoreProvider[id = (op=0,part=183),dir = dbfs:/checkpoint_loc/state/0/183]: premature EOF reached&lt;/B&gt;&lt;/P&gt;&lt;P&gt;	at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.readSnapshotFile(HDFSBackedStateStoreProvider.scala:642)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.$anonfun$loadMap$3(HDFSBackedStateStoreProvider.scala:437)&lt;/P&gt;&lt;P&gt;	at scala.Option.orElse(Option.scala:447)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.$anonfun$loadMap$2(HDFSBackedStateStoreProvider.scala:437)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:642)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.loadMap(HDFSBackedStateStoreProvider.scala:417)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getLoadedMapForStore(HDFSBackedStateStoreProvider.scala:245)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:229)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:500)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:125)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:380)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.rdd.RDD.iterator(RDD.scala:344)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:380)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.rdd.RDD.iterator(RDD.scala:344)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)&lt;/P&gt;&lt;P&gt;	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)&lt;/P&gt;&lt;P&gt;	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)&lt;/P&gt;&lt;P&gt;	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.scheduler.Task.run(Task.scala:91)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1620)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)&lt;/P&gt;&lt;P&gt;	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)&lt;/P&gt;&lt;P&gt;	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)&lt;/P&gt;&lt;P&gt;	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)&lt;/P&gt;&lt;P&gt;	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)&lt;/P&gt;&lt;P&gt;	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)&lt;/P&gt;&lt;P&gt;	at java.lang.Thread.run(Thread.java:748)&lt;/P&gt;</description>
      <pubDate>Thu, 27 Jan 2022 10:39:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30773#M22340</guid>
      <dc:creator>RohanB</dc:creator>
      <dc:date>2022-01-27T10:39:37Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming - Checkpoint State EOF Exception</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30774#M22341</link>
      <description>&lt;P&gt;Hello, @Rohan Bawdekar​&amp;nbsp;- My name is Piper, and I'm a moderator for the Databricks community. Welcome to the community and thank you for asking!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Let's give it a while longer to give the members of the community a chance to respond before we come back to this. &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 27 Jan 2022 22:45:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30774#M22341</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-01-27T22:45:59Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming - Checkpoint State EOF Exception</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30775#M22342</link>
      <description>&lt;P&gt;@Rohan Bawdekar​&amp;nbsp;-You are not forgotten. We are looking for someone to help you.&lt;/P&gt;</description>
      <pubDate>Mon, 07 Feb 2022 16:26:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30775#M22342</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-02-07T16:26:53Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming - Checkpoint State EOF Exception</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30776#M22343</link>
      <description>&lt;P&gt;Hi @Rohan Bawdekar​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Could you share your writeStream code? I would like to know what do you use for option("checkpointLocation", "&amp;lt;storage_location&amp;gt;")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you,&lt;/P&gt;&lt;P&gt;--Jose&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 09 Feb 2022 00:31:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30776#M22343</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-02-09T00:31:32Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming - Checkpoint State EOF Exception</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30777#M22344</link>
      <description>&lt;P&gt;Hi @Jose Gonzalez​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks for the reply. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Write stream is as follows&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;dataset.writeStream
                 .format("delta")
                 .outputMode("append")
                 .option("checkpointLocation", "dbfs:/checkpoints_v1/&amp;lt;table_name&amp;gt;")
                 .option("mergeSchema", "true")
                 .table("&amp;lt;table_name&amp;gt;")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The Dataset is created after fetching records from delta tables in a stream and applying the flatMapGroupsWithState API on the records.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Few Observations:&lt;/P&gt;&lt;P&gt;1) The probability of the error occurring is more at the higher loads. If the input rate for the flatMapGroupsWithState API is 12000 records/second, then then the error occurs regularly and generally within 2 hours of the start of the job. For lower loads of 4000 records / seconds, the error occurs infrequently .&lt;/P&gt;&lt;P&gt;2) The size of the snapshot file for which the error occurs is 0 bytes.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Let me know if you require any other information.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks &amp;amp; Regards,&lt;/P&gt;&lt;P&gt;Rohan&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 09 Feb 2022 04:45:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30777#M22344</guid>
      <dc:creator>RohanB</dc:creator>
      <dc:date>2022-02-09T04:45:03Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming - Checkpoint State EOF Exception</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30778#M22345</link>
      <description>&lt;P&gt;Hi @Jose Gonzalez​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Do you require any more information regarding the code? Any idea what could be cause for the issue?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks and Regards,&lt;/P&gt;&lt;P&gt;Rohan&lt;/P&gt;</description>
      <pubDate>Tue, 15 Feb 2022 12:27:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30778#M22345</guid>
      <dc:creator>RohanB</dc:creator>
      <dc:date>2022-02-15T12:27:03Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming - Checkpoint State EOF Exception</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30779#M22346</link>
      <description>&lt;P&gt;Hi @Rohan Bawdekar​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You are using DBFS mont point in your checkpoint, what is this reference to? storage?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you,&lt;/P&gt;&lt;P&gt;--Jose&lt;/P&gt;</description>
      <pubDate>Wed, 16 Feb 2022 19:04:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30779#M22346</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-02-16T19:04:49Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming - Checkpoint State EOF Exception</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30780#M22347</link>
      <description>&lt;P&gt;Hi @Jose Gonzalez​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Yes, it is referring to the GCP storage.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks and Regards,&lt;/P&gt;&lt;P&gt;Rohan&lt;/P&gt;</description>
      <pubDate>Thu, 17 Feb 2022 05:37:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30780#M22347</guid>
      <dc:creator>RohanB</dc:creator>
      <dc:date>2022-02-17T05:37:28Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming - Checkpoint State EOF Exception</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30781#M22348</link>
      <description>&lt;P&gt;You are using open source HDFS state store provider. There is an issue when trying to read your checkpoint. I will highly recommend to use RocksDB state store &lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/production.html#configure-rocksdb-state-store" target="test_blank"&gt;https://docs.databricks.com/spark/latest/structured-streaming/production.html#configure-rocksdb-state-store&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Streaming state  store providers: &lt;A href="https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state" target="test_blank"&gt;https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 07 Mar 2022 23:22:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-state-eof-exception/m-p/30781#M22348</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-03-07T23:22:54Z</dc:date>
    </item>
  </channel>
</rss>

