<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Spark streaming: Checkpoint not recognising new data in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-not-recognising-new-data/m-p/12496#M7296</link>
    <description>&lt;P&gt;Also the trigger is configured to run once, but when we start the job it never ends, it keeps in an endless loop.&lt;/P&gt;</description>
    <pubDate>Tue, 26 Jul 2022 13:15:11 GMT</pubDate>
    <dc:creator>mriccardi</dc:creator>
    <dc:date>2022-07-26T13:15:11Z</dc:date>
    <item>
      <title>Spark streaming: Checkpoint not recognising new data</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-not-recognising-new-data/m-p/12495#M7295</link>
      <description>&lt;P&gt;Hello everyone!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We are currently facing an issue with a stream that is not updating new data since the 20 of July.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We've validated and bronze table has data that silver doesn't have.&lt;/P&gt;&lt;P&gt;Also seeing the logs the silver stream is running but writing 0 files.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;OffsetSeqLog: BatchIds found from listing: 0
22/07/26 12:45:14 INFO OffsetSeqLog: Getting latest offset 0
22/07/26 12:45:14 INFO CommitLog: BatchIds found from listing: 0
22/07/26 12:45:14 INFO CommitLog: Getting latest offset 0
22/07/26 12:45:14 INFO MicroBatchExecution: Query start: last started microbatch offset info = Some((0,[{"sourceVersion":1,"reservoirId":"271090ee-5d4b-4087-a6a0-5a9760d969d8","reservoirVersion":916406,"index":-1,"isStartingVersion":false}])), last successfully finished microbatch offset info = Some((0,CommitMetadata(0)))
22/07/26 12:45:14 INFO OffsetSeqLog: BatchIds found from listing: 0
22/07/26 12:45:14 INFO OffsetSeqLog: Getting latest offset 0
22/07/26 12:45:15 INFO CommitLog: BatchIds found from listing: 0
22/07/26 12:45:15 INFO CommitLog: Getting latest offset 0&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Basically the job reads the bronze table, apply some transformations and write to our silver path.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;bronze_df = spark \
        .readStream \
        .format("delta") \
        .load(str(INPUT_PATH))
&amp;nbsp;
df = transform(bronze_df)
&amp;nbsp;
 pc_df = df \
        .writeStream \
        .outputMode("append") \
        .trigger(once=True) \
        .format("delta") \
        .option("checkpointLocation", CHECKPOINT_PATH) \
        .partitionBy("event_date", "event_hour", "ad_type") \
        .queryName(f"prod_silver_v2") \
        .start(OUTPUT_PATH)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;To add: Last week we reprocessed this table as we are repartitioning it. On our first run (8hs) the final step was to optimize the silver table and the job failed on that step. After that we saw that the table had the expected data, but after that run we couldn't update this table any more. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Could this be related?&lt;/P&gt;&lt;P&gt;Is there any way to "recover" the checkpoint to a previous state?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks in advance!&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2022 13:10:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-not-recognising-new-data/m-p/12495#M7295</guid>
      <dc:creator>mriccardi</dc:creator>
      <dc:date>2022-07-26T13:10:34Z</dc:date>
    </item>
    <item>
      <title>Re: Spark streaming: Checkpoint not recognising new data</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-not-recognising-new-data/m-p/12496#M7296</link>
      <description>&lt;P&gt;Also the trigger is configured to run once, but when we start the job it never ends, it keeps in an endless loop.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2022 13:15:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-not-recognising-new-data/m-p/12496#M7296</guid>
      <dc:creator>mriccardi</dc:creator>
      <dc:date>2022-07-26T13:15:11Z</dc:date>
    </item>
    <item>
      <title>Re: Spark streaming: Checkpoint not recognising new data</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-not-recognising-new-data/m-p/12497#M7297</link>
      <description>&lt;P&gt;Did you delete the checkpoint by mistake? in case you did, then you can use "startingVersio" to define the offset version that you would like to start reading from. Here is more docs &lt;A href="https://docs.databricks.com/delta/delta-streaming.html#specify-initial-position" target="test_blank"&gt;https://docs.databricks.com/delta/delta-streaming.html#specify-initial-position&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 29 Jul 2022 23:40:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-not-recognising-new-data/m-p/12497#M7297</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-07-29T23:40:21Z</dc:date>
    </item>
    <item>
      <title>Re: Spark streaming: Checkpoint not recognising new data</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-not-recognising-new-data/m-p/12498#M7298</link>
      <description>&lt;P&gt;Hi @Martin Riccardi​,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Just a friendly follow-up. Did you see my previous response? did it help you? please let us know&lt;/P&gt;</description>
      <pubDate>Mon, 15 Aug 2022 23:09:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-not-recognising-new-data/m-p/12498#M7298</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-08-15T23:09:17Z</dc:date>
    </item>
    <item>
      <title>Re: Spark streaming: Checkpoint not recognising new data</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-not-recognising-new-data/m-p/54996#M30214</link>
      <description>&lt;P&gt;But how one can assign the startingVersion in production, it should pick the data from where the job got failed.&lt;/P&gt;&lt;P&gt;I am encountering a similar issue where the checkpoint location is consistently updated with new offset values every 15 minutes, yet the streaming data fails to load in delta table although I could see stream is happening. I have attempted using startingOffset=-1/@latest, but unfortunately, none of these approaches seem to resolve the issue&lt;/P&gt;</description>
      <pubDate>Sun, 10 Dec 2023 15:17:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-checkpoint-not-recognising-new-data/m-p/54996#M30214</guid>
      <dc:creator>Himanshu16</dc:creator>
      <dc:date>2023-12-10T15:17:51Z</dc:date>
    </item>
  </channel>
</rss>

