lingareddy_Alva
Esteemed Contributor

Hi @databricks_use2 

Default Checkpoint Location

When you don't specify a checkpoint location, Autoloader stores checkpoints in:
/tmp/checkpoints/<stream-id>/

The <stream-id> is auto-generated based on your stream configuration.


Finding Your Checkpoint
Option 1: Check Spark UI
Look at your streaming query details in the Spark UI
The checkpoint location will be displayed in the query information

Option 2: List tmp checkpoints
dbutils.fs.ls("/tmp/checkpoints/")

Resetting the Checkpoint
Option 1: Delete specific files (safest)
# Navigate to your checkpoint directory
checkpoint_path = "/tmp/checkpoints/<your-stream-id>/"

# Remove the problematic file entries from the offset log
# This requires careful manual editing of the offset files

Option 2: Reset to earlier offset
# Stop your stream first
query.stop()

# Remove files after a specific date from the offset log
# (Complex - requires parsing JSON offset files)

Option 3: Fresh start (simplest)
# Delete entire checkpoint and restart
dbutils.fs.rm("/tmp/checkpoints/<your-stream-id>/", True)

# Restart your autoloader with explicit checkpoint location
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("checkpointLocation", "/path/to/new/checkpoint") \
.load("s3://your-bucket/path/")

Recommendation: Use Option 3 with an explicit checkpoint location for future runs to avoid this issue.

 

LR

View solution in original post