cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Autloader Data Reprocess

AvneeshSingh
New Contributor

Hi ,

If possible can any please help me with some autloader options I have 2 open queries ,

(i) Let assume I am running some autoloader stream and if my job fails, so instead of resetting the whole checkpoint, I want to run stream from specified timestamp or last day or last week, how can i do that?

(ii)and moreover if my code fails, mine autoloader stream is not picking data from the last failed batch, is there any possible reason or should need to change some configurations

2 REPLIES 2

AbhaySingh
Databricks Employee
Databricks Employee

Have you reviewed following doc already? Please let me know specifics and we can go from there but i'd start with following doc.

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options

 

mark_ott
Databricks Employee
Databricks Employee

In Databricks Autoloader, controlling the starting point for streaming data after a job failure requires careful management of checkpoints and configuration options. By default, Autoloader uses checkpoints to remember where the stream last left off, so you don't miss or reprocess data. However, if you want to start the stream from a specific timestamp, day, or week rather than resetting the whole checkpoint, here are your possible approaches:

1. Using cloudFiles.startAfter Option

  • The cloudFiles.startAfter option lets you tell Autoloader to start ingesting new files whose names are lexicographically after the specified file name.

  • This isnโ€™t based on timestamp, but if your source files are named with timestamps or dates, you can leverage this.

Example:

python
spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "json") \ .option("cloudFiles.startAfter", "20251024") \ .load("dbfs:/mnt/my-data/")

This starts reading files that come after "20251024" in lexicographical order.

2. Filtering Data by Timestamp Column

  • If your data files include a timestamp field, you can add a .where() filter in your streaming DataFrame to process only records after a certain instant.

Example:

python
from pyspark.sql.functions import col df = spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "json") \ .load("dbfs:/mnt/my-data/") \ .where(col("event_time") >= "2025-10-24T00:00:00Z")

This reads ALL data, but only processes records after a certain timestamp.

3. Manually Manipulating Checkpoints

  • Generally not recommended, but if you intentionally delete the old checkpoint and restart your stream with the above filtering or startAfter, you can emulate starting from a certain point.

  • Caution: Deleting or editing checkpoint files can cause data duplication if you are not careful.

4. Time-Based Partitioning (If Applicable)

  • If source files are partitioned by date, you can point Autoloader to just the folder(s) for the day or week you want to reprocess.

  • For example, loading only dbfs:/mnt/my-data/2025/10/24/ will ingest just that day's data.


There is no direct "start from specific timestamp" option in Autoloader's checkpointing. Workarounds rely on file-based navigation (startAfter) or record filtering in the DataFrame. Always test in non-production before making checkpoint adjustments.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now