Databricks Community

AvneeshSingh · ‎02-05-2025

Hi ,

If possible can any please help me with some autloader options I have 2 open queries ,

(i) Let assume I am running some autoloader stream and if my job fails, so instead of resetting the whole checkpoint, I want to run stream from specified timestamp or last day or last week, how can i do that?

(ii)and moreover if my code fails, mine autoloader stream is not picking data from the last failed batch, is there any possible reason or should need to change some configurations

AbhaySingh · a week ago

Have you reviewed following doc already? Please let me know specifics and we can go from there but i'd start with following doc.

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options

mark_ott · Friday

In Databricks Autoloader, controlling the starting point for streaming data after a job failure requires careful management of checkpoints and configuration options. By default, Autoloader uses checkpoints to remember where the stream last left off, so you don't miss or reprocess data. However, if you want to start the stream from a specific timestamp, day, or week rather than resetting the whole checkpoint, here are your possible approaches:

1. Using `cloudFiles.startAfter` Option

The cloudFiles.startAfter option lets you tell Autoloader to start ingesting new files whose names are lexicographically after the specified file name.
This isn’t based on timestamp, but if your source files are named with timestamps or dates, you can leverage this.

Example:

python

spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "json") \
  .option("cloudFiles.startAfter", "20251024") \
  .load("dbfs:/mnt/my-data/")

This starts reading files that come after "20251024" in lexicographical order.

2. Filtering Data by Timestamp Column

If your data files include a timestamp field, you can add a .where() filter in your streaming DataFrame to process only records after a certain instant.

Example:

python

from pyspark.sql.functions import col

df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "json") \
  .load("dbfs:/mnt/my-data/") \
  .where(col("event_time") >= "2025-10-24T00:00:00Z")

This reads ALL data, but only processes records after a certain timestamp.

3. Manually Manipulating Checkpoints

Generally not recommended, but if you intentionally delete the old checkpoint and restart your stream with the above filtering or startAfter, you can emulate starting from a certain point.
Caution: Deleting or editing checkpoint files can cause data duplication if you are not careful.

4. Time-Based Partitioning (If Applicable)

If source files are partitioned by date, you can point Autoloader to just the folder(s) for the day or week you want to reprocess.
For example, loading only dbfs:/mnt/my-data/2025/10/24/ will ingest just that day's data.

There is no direct "start from specific timestamp" option in Autoloader's checkpointing. Workarounds rely on file-based navigation (startAfter) or record filtering in the DataFrame. Always test in non-production before making checkpoint adjustments.

Databricks Community

Autloader Data Reprocess

1. Using `cloudFiles.startAfter` Option

2. Filtering Data by Timestamp Column

3. Manually Manipulating Checkpoints

4. Time-Based Partitioning (If Applicable)

Join Us as a Local Community Builder!

Free Edition Hackathon

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Zerobus Ingest in Action: How to Stream Event Data Directly into Your Lakehouse

Find Sensitive Data at Scale with Data Classification in Unity Catalog

🚀 New: Databricks Interactive Architecture Design Workshops

Databricks Community

Autloader Data Reprocess

1. Using cloudFiles.startAfter Option

2. Filtering Data by Timestamp Column

3. Manually Manipulating Checkpoints

4. Time-Based Partitioning (If Applicable)

Join Us as a Local Community Builder!

Free Edition Hackathon

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Zerobus Ingest in Action: How to Stream Event Data Directly into Your Lakehouse

Find Sensitive Data at Scale with Data Classification in Unity Catalog

🚀 New: Databricks Interactive Architecture Design Workshops

1. Using `cloudFiles.startAfter` Option