Hi @databricks_use2
Default Checkpoint Location
When you don't specify a checkpoint location, Autoloader stores checkpoints in:
/tmp/checkpoints/<stream-id>/
The <stream-id> is auto-generated based on your stream configuration.
Finding Your Checkpoint
Option 1: Check Spark UI
Look at your streaming query details in the Spark UI
The checkpoint location will be displayed in the query information
Option 2: List tmp checkpoints
dbutils.fs.ls("/tmp/checkpoints/")
Resetting the Checkpoint
Option 1: Delete specific files (safest)
# Navigate to your checkpoint directory
checkpoint_path = "/tmp/checkpoints/<your-stream-id>/"
# Remove the problematic file entries from the offset log
# This requires careful manual editing of the offset files
Option 2: Reset to earlier offset
# Stop your stream first
query.stop()
# Remove files after a specific date from the offset log
# (Complex - requires parsing JSON offset files)
Option 3: Fresh start (simplest)
# Delete entire checkpoint and restart
dbutils.fs.rm("/tmp/checkpoints/<your-stream-id>/", True)
# Restart your autoloader with explicit checkpoint location
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("checkpointLocation", "/path/to/new/checkpoint") \
.load("s3://your-bucket/path/")
Recommendation: Use Option 3 with an explicit checkpoint location for future runs to avoid this issue.
LR