Databricks Community

databricks_use2 · ‎07-21-2025

I was pulling data from an S3 source using a Databricks Autoloader pipeline. Some files in the source contained bad characters, which caused the Autoloader to fail to load the data. These problematic files have now been removed from the source, but Databricks continues to complain about them. I want to reset the checkpoint to an earlier date to reprocess the data, but I didn’t explicitly specify a checkpoint location in the Autoloader configuration. Where is the default checkpoint location stored, and how can I reset it to a previous date?

lingareddy_Alva · ‎07-21-2025

Hi @databricks_use2

Default Checkpoint Location

When you don't specify a checkpoint location, Autoloader stores checkpoints in:
/tmp/checkpoints/<stream-id>/

The <stream-id> is auto-generated based on your stream configuration.

Finding Your Checkpoint
Option 1: Check Spark UI
Look at your streaming query details in the Spark UI
The checkpoint location will be displayed in the query information

Option 2: List tmp checkpoints
dbutils.fs.ls("/tmp/checkpoints/")

Resetting the Checkpoint
Option 1: Delete specific files (safest)
# Navigate to your checkpoint directory
checkpoint_path = "/tmp/checkpoints/<your-stream-id>/"

# Remove the problematic file entries from the offset log
# This requires careful manual editing of the offset files

Option 2: Reset to earlier offset
# Stop your stream first
query.stop()

# Remove files after a specific date from the offset log
# (Complex - requires parsing JSON offset files)

Option 3: Fresh start (simplest)
# Delete entire checkpoint and restart
dbutils.fs.rm("/tmp/checkpoints/<your-stream-id>/", True)

# Restart your autoloader with explicit checkpoint location
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("checkpointLocation", "/path/to/new/checkpoint") \
.load("s3://your-bucket/path/")

Recommendation: Use Option 3 with an explicit checkpoint location for future runs to avoid this issue.

LR

View solution in original post

lingareddy_Alva · ‎07-21-2025