<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Autoloader Checkpoint Issue in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autoloader-checkpoint-issue/m-p/125921#M47582</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175432"&gt;@databricks_use2&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Default Checkpoint Location&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;When you don't specify a checkpoint location, Autoloader stores checkpoints in:&lt;BR /&gt;/tmp/checkpoints/&amp;lt;stream-id&amp;gt;/&lt;/P&gt;&lt;P&gt;The &amp;lt;stream-id&amp;gt; is auto-generated based on your stream configuration.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Finding Your Checkpoint&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;Option 1: Check Spark UI&lt;/STRONG&gt;&lt;BR /&gt;Look at your streaming query details in the Spark UI&lt;BR /&gt;The checkpoint location will be displayed in the query information&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option 2: List tmp checkpoints&lt;/STRONG&gt;&lt;BR /&gt;dbutils.fs.ls("/tmp/checkpoints/")&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Resetting the Checkpoint&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;Option 1: Delete specific files (safest)&lt;/STRONG&gt;&lt;BR /&gt;# Navigate to your checkpoint directory&lt;BR /&gt;checkpoint_path = "/tmp/checkpoints/&amp;lt;your-stream-id&amp;gt;/"&lt;/P&gt;&lt;P&gt;# Remove the problematic file entries from the offset log&lt;BR /&gt;# This requires careful manual editing of the offset files&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option 2: Reset to earlier offset&lt;/STRONG&gt;&lt;BR /&gt;# Stop your stream first&lt;BR /&gt;query.stop()&lt;/P&gt;&lt;P&gt;# Remove files after a specific date from the offset log&lt;BR /&gt;# (Complex - requires parsing JSON offset files)&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option 3: Fresh start (simplest)&lt;/STRONG&gt;&lt;BR /&gt;# Delete entire checkpoint and restart&lt;BR /&gt;dbutils.fs.rm("/tmp/checkpoints/&amp;lt;your-stream-id&amp;gt;/", True)&lt;/P&gt;&lt;P&gt;# Restart your autoloader with explicit checkpoint location&lt;BR /&gt;df = spark.readStream.format("cloudFiles") \&lt;BR /&gt;.option("cloudFiles.format", "json") \&lt;BR /&gt;.option("checkpointLocation", "/path/to/new/checkpoint") \&lt;BR /&gt;.load("s3://your-bucket/path/")&lt;/P&gt;&lt;P&gt;Recommendation: Use Option 3 with an explicit checkpoint location for future runs to avoid this issue.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 22 Jul 2025 03:53:50 GMT</pubDate>
    <dc:creator>lingareddy_Alva</dc:creator>
    <dc:date>2025-07-22T03:53:50Z</dc:date>
    <item>
      <title>Autoloader Checkpoint Issue</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-checkpoint-issue/m-p/125910#M47576</link>
      <description>&lt;P&gt;I was pulling data from an S3 source using a Databricks Autoloader pipeline. Some files in the source contained bad characters, which caused the Autoloader to fail to load the data. These problematic files have now been removed from the source, but Databricks continues to complain about them. I want to reset the checkpoint to an earlier date to reprocess the data, but I didn’t explicitly specify a checkpoint location in the Autoloader configuration. Where is the default checkpoint location stored, and how can I reset it to a previous date?&lt;/P&gt;</description>
      <pubDate>Tue, 22 Jul 2025 00:23:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-checkpoint-issue/m-p/125910#M47576</guid>
      <dc:creator>databricks_use2</dc:creator>
      <dc:date>2025-07-22T00:23:56Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader Checkpoint Issue</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-checkpoint-issue/m-p/125921#M47582</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175432"&gt;@databricks_use2&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Default Checkpoint Location&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;When you don't specify a checkpoint location, Autoloader stores checkpoints in:&lt;BR /&gt;/tmp/checkpoints/&amp;lt;stream-id&amp;gt;/&lt;/P&gt;&lt;P&gt;The &amp;lt;stream-id&amp;gt; is auto-generated based on your stream configuration.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Finding Your Checkpoint&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;Option 1: Check Spark UI&lt;/STRONG&gt;&lt;BR /&gt;Look at your streaming query details in the Spark UI&lt;BR /&gt;The checkpoint location will be displayed in the query information&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option 2: List tmp checkpoints&lt;/STRONG&gt;&lt;BR /&gt;dbutils.fs.ls("/tmp/checkpoints/")&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Resetting the Checkpoint&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;Option 1: Delete specific files (safest)&lt;/STRONG&gt;&lt;BR /&gt;# Navigate to your checkpoint directory&lt;BR /&gt;checkpoint_path = "/tmp/checkpoints/&amp;lt;your-stream-id&amp;gt;/"&lt;/P&gt;&lt;P&gt;# Remove the problematic file entries from the offset log&lt;BR /&gt;# This requires careful manual editing of the offset files&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option 2: Reset to earlier offset&lt;/STRONG&gt;&lt;BR /&gt;# Stop your stream first&lt;BR /&gt;query.stop()&lt;/P&gt;&lt;P&gt;# Remove files after a specific date from the offset log&lt;BR /&gt;# (Complex - requires parsing JSON offset files)&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option 3: Fresh start (simplest)&lt;/STRONG&gt;&lt;BR /&gt;# Delete entire checkpoint and restart&lt;BR /&gt;dbutils.fs.rm("/tmp/checkpoints/&amp;lt;your-stream-id&amp;gt;/", True)&lt;/P&gt;&lt;P&gt;# Restart your autoloader with explicit checkpoint location&lt;BR /&gt;df = spark.readStream.format("cloudFiles") \&lt;BR /&gt;.option("cloudFiles.format", "json") \&lt;BR /&gt;.option("checkpointLocation", "/path/to/new/checkpoint") \&lt;BR /&gt;.load("s3://your-bucket/path/")&lt;/P&gt;&lt;P&gt;Recommendation: Use Option 3 with an explicit checkpoint location for future runs to avoid this issue.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 22 Jul 2025 03:53:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-checkpoint-issue/m-p/125921#M47582</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-07-22T03:53:50Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader Checkpoint Issue</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-checkpoint-issue/m-p/128283#M48199</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175432"&gt;@databricks_use2&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If you are okay this please make this as solution so that this can help others.&lt;/P&gt;</description>
      <pubDate>Tue, 12 Aug 2025 22:20:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-checkpoint-issue/m-p/128283#M48199</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-08-12T22:20:48Z</dc:date>
    </item>
  </channel>
</rss>

