cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Autoloader Checkpoint Issue

databricks_use2
New Contributor II

I was pulling data from an S3 source using a Databricks Autoloader pipeline. Some files in the source contained bad characters, which caused the Autoloader to fail to load the data. These problematic files have now been removed from the source, but Databricks continues to complain about them. I want to reset the checkpoint to an earlier date to reprocess the data, but I didnโ€™t explicitly specify a checkpoint location in the Autoloader configuration. Where is the default checkpoint location stored, and how can I reset it to a previous date?

1 ACCEPTED SOLUTION

Accepted Solutions

lingareddy_Alva
Honored Contributor III

Hi @databricks_use2 

Default Checkpoint Location

When you don't specify a checkpoint location, Autoloader stores checkpoints in:
/tmp/checkpoints/<stream-id>/

The <stream-id> is auto-generated based on your stream configuration.


Finding Your Checkpoint
Option 1: Check Spark UI
Look at your streaming query details in the Spark UI
The checkpoint location will be displayed in the query information

Option 2: List tmp checkpoints
dbutils.fs.ls("/tmp/checkpoints/")

Resetting the Checkpoint
Option 1: Delete specific files (safest)
# Navigate to your checkpoint directory
checkpoint_path = "/tmp/checkpoints/<your-stream-id>/"

# Remove the problematic file entries from the offset log
# This requires careful manual editing of the offset files

Option 2: Reset to earlier offset
# Stop your stream first
query.stop()

# Remove files after a specific date from the offset log
# (Complex - requires parsing JSON offset files)

Option 3: Fresh start (simplest)
# Delete entire checkpoint and restart
dbutils.fs.rm("/tmp/checkpoints/<your-stream-id>/", True)

# Restart your autoloader with explicit checkpoint location
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("checkpointLocation", "/path/to/new/checkpoint") \
.load("s3://your-bucket/path/")

Recommendation: Use Option 3 with an explicit checkpoint location for future runs to avoid this issue.

 

LR

View solution in original post

2 REPLIES 2

lingareddy_Alva
Honored Contributor III

Hi @databricks_use2 

Default Checkpoint Location

When you don't specify a checkpoint location, Autoloader stores checkpoints in:
/tmp/checkpoints/<stream-id>/

The <stream-id> is auto-generated based on your stream configuration.


Finding Your Checkpoint
Option 1: Check Spark UI
Look at your streaming query details in the Spark UI
The checkpoint location will be displayed in the query information

Option 2: List tmp checkpoints
dbutils.fs.ls("/tmp/checkpoints/")

Resetting the Checkpoint
Option 1: Delete specific files (safest)
# Navigate to your checkpoint directory
checkpoint_path = "/tmp/checkpoints/<your-stream-id>/"

# Remove the problematic file entries from the offset log
# This requires careful manual editing of the offset files

Option 2: Reset to earlier offset
# Stop your stream first
query.stop()

# Remove files after a specific date from the offset log
# (Complex - requires parsing JSON offset files)

Option 3: Fresh start (simplest)
# Delete entire checkpoint and restart
dbutils.fs.rm("/tmp/checkpoints/<your-stream-id>/", True)

# Restart your autoloader with explicit checkpoint location
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("checkpointLocation", "/path/to/new/checkpoint") \
.load("s3://your-bucket/path/")

Recommendation: Use Option 3 with an explicit checkpoint location for future runs to avoid this issue.

 

LR

lingareddy_Alva
Honored Contributor III

Hello @databricks_use2 

If you are okay this please make this as solution so that this can help others.

LR

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now