<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Reset committed offset of spark streaming to capture missed data in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/reset-committed-offset-of-spark-streaming-to-capture-missed-data/m-p/140857#M51548</link>
    <description>&lt;P&gt;Thank you K_Anudeep! The REST API is exactly what I was looking for.&lt;/P&gt;</description>
    <pubDate>Tue, 02 Dec 2025 10:52:42 GMT</pubDate>
    <dc:creator>DatabricksUser5</dc:creator>
    <dc:date>2025-12-02T10:52:42Z</dc:date>
    <item>
      <title>Reset committed offset of spark streaming to capture missed data</title>
      <link>https://community.databricks.com/t5/data-engineering/reset-committed-offset-of-spark-streaming-to-capture-missed-data/m-p/140613#M51486</link>
      <description>&lt;P&gt;I have a very straightforward setup between Azure Eventhub and DLT using the kafka endpoint through spark streaming.&lt;/P&gt;&lt;P&gt;There were network issues and the stream didn't pick up some event, but still progressed (and committed) the offset for some reason&lt;/P&gt;&lt;P&gt;As such, the DLT now picks up any new data coming into the eventhub, but not the events that arrived prior to the network issue being resolved&lt;/P&gt;&lt;P&gt;Is there a way to force reset the offset of the spark reader to always be earliest? At the moment, setting the offset desired does not work as there already is a committed offset to be used, but I want to override that&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Alternative would be to create a new partition and move events that were not picked up there, or re-ingest the events that are prior to the committed offset, but that's really not elegant imo&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 28 Nov 2025 13:55:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/reset-committed-offset-of-spark-streaming-to-capture-missed-data/m-p/140613#M51486</guid>
      <dc:creator>DatabricksUser5</dc:creator>
      <dc:date>2025-11-28T13:55:57Z</dc:date>
    </item>
    <item>
      <title>Re: Reset committed offset of spark streaming to capture missed data</title>
      <link>https://community.databricks.com/t5/data-engineering/reset-committed-offset-of-spark-streaming-to-capture-missed-data/m-p/140633#M51495</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/199238"&gt;@DatabricksUser5&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;You can’t override committed offsets in-place for a running DLT Kafka/Event Hubs stream. If a pipeline already has a checkpoint created, &lt;STRONG&gt;startingOffsets&lt;/STRONG&gt; is ignored. To replay data, you must reset the streaming checkpoints or create a new checkpoint using FULL REFRESH in DLT, and the events must still be retained in Event Hubs.&lt;/P&gt;
&lt;P&gt;For Kafka sources (including Event Hubs Kafka endpoint), Spark Structured Streaming:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Uses startingOffsets only when the streaming query is initially created.&lt;/LI&gt;
&lt;LI&gt;After that, it always resumes from the offsets stored in the checkpoint directory, completely ignoring &lt;STRONG&gt;startingOffsets&lt;/STRONG&gt; on restart.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;You can refer to the KB for more details here: &lt;A href="https://kb.databricks.com/streaming/offset-reprocessing-issues-in-streaming-queries-with-a-kafka-source" target="_blank" rel="noopener"&gt;https://kb.databricks.com/streaming/offset-reprocessing-issues-in-streaming-queries-with-a-kafka-source&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;So, below are the mitigations for you scenario:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Full refresh for the streaming table/pipeline (resets checkpoints and data) (OR)&lt;/LI&gt;
&lt;LI&gt;Reset streaming flow checkpoints (REST API, keeps table data) using&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/ldp/updates#start-a-pipeline-update-to-clear-selective-streaming-flows-checkpoints" target="_blank" rel="noopener"&gt;https://docs.databricks.com/aws/en/ldp/updates#start-a-pipeline-update-to-clear-selective-streaming-flows-checkpoints&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 29 Nov 2025 09:15:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/reset-committed-offset-of-spark-streaming-to-capture-missed-data/m-p/140633#M51495</guid>
      <dc:creator>K_Anudeep</dc:creator>
      <dc:date>2025-11-29T09:15:15Z</dc:date>
    </item>
    <item>
      <title>Re: Reset committed offset of spark streaming to capture missed data</title>
      <link>https://community.databricks.com/t5/data-engineering/reset-committed-offset-of-spark-streaming-to-capture-missed-data/m-p/140760#M51526</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/199238"&gt;@DatabricksUser5&lt;/a&gt;&amp;nbsp;also look for eventhub retention policies for example if your eventhub retention is set to 7 days which means those events older than 7 days which you are trying to re-process are already doesn't exist in eventhub so no matter what you choose they are gone and you need to replay them afresh.&lt;BR /&gt;&lt;BR /&gt;also setting to earliest or latest for kafka obeyed only for first run with clean checkpoints, after streeaming will always respect checkpoints. you wont be risking cleaning checkpoints else you may face duplicates in your data if only append operation.&lt;/P&gt;&lt;P&gt;Br&lt;/P&gt;</description>
      <pubDate>Mon, 01 Dec 2025 15:56:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/reset-committed-offset-of-spark-streaming-to-capture-missed-data/m-p/140760#M51526</guid>
      <dc:creator>saurabh18cs</dc:creator>
      <dc:date>2025-12-01T15:56:34Z</dc:date>
    </item>
    <item>
      <title>Re: Reset committed offset of spark streaming to capture missed data</title>
      <link>https://community.databricks.com/t5/data-engineering/reset-committed-offset-of-spark-streaming-to-capture-missed-data/m-p/140857#M51548</link>
      <description>&lt;P&gt;Thank you K_Anudeep! The REST API is exactly what I was looking for.&lt;/P&gt;</description>
      <pubDate>Tue, 02 Dec 2025 10:52:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/reset-committed-offset-of-spark-streaming-to-capture-missed-data/m-p/140857#M51548</guid>
      <dc:creator>DatabricksUser5</dc:creator>
      <dc:date>2025-12-02T10:52:42Z</dc:date>
    </item>
    <item>
      <title>Re: Reset committed offset of spark streaming to capture missed data</title>
      <link>https://community.databricks.com/t5/data-engineering/reset-committed-offset-of-spark-streaming-to-capture-missed-data/m-p/140871#M51553</link>
      <description>&lt;P&gt;This unfortunately only work on pipeline that are not continuous, which mine is.&lt;/P&gt;</description>
      <pubDate>Tue, 02 Dec 2025 13:29:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/reset-committed-offset-of-spark-streaming-to-capture-missed-data/m-p/140871#M51553</guid>
      <dc:creator>DatabricksUser5</dc:creator>
      <dc:date>2025-12-02T13:29:59Z</dc:date>
    </item>
  </channel>
</rss>

