In Databricks Autoloader, controlling the starting point for streaming data after a job failure requires careful management of checkpoints and configuration options. By default, Autoloader uses checkpoints to remember where the stream last left off, so you don't miss or reprocess data. However, if you want to start the stream from a specific timestamp, day, or week rather than resetting the whole checkpoint, here are your possible approaches:
1. Using cloudFiles.startAfter Option
-
The cloudFiles.startAfter option lets you tell Autoloader to start ingesting new files whose names are lexicographically after the specified file name.
-
This isnโt based on timestamp, but if your source files are named with timestamps or dates, you can leverage this.
Example:
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.option("cloudFiles.startAfter", "20251024") \
.load("dbfs:/mnt/my-data/")
This starts reading files that come after "20251024" in lexicographical order.
2. Filtering Data by Timestamp Column
-
If your data files include a timestamp field, you can add a .where() filter in your streaming DataFrame to process only records after a certain instant.
Example:
from pyspark.sql.functions import col
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.load("dbfs:/mnt/my-data/") \
.where(col("event_time") >= "2025-10-24T00:00:00Z")
This reads ALL data, but only processes records after a certain timestamp.
3. Manually Manipulating Checkpoints
-
Generally not recommended, but if you intentionally delete the old checkpoint and restart your stream with the above filtering or startAfter, you can emulate starting from a certain point.
-
Caution: Deleting or editing checkpoint files can cause data duplication if you are not careful.
4. Time-Based Partitioning (If Applicable)
-
If source files are partitioned by date, you can point Autoloader to just the folder(s) for the day or week you want to reprocess.
-
For example, loading only dbfs:/mnt/my-data/2025/10/24/ will ingest just that day's data.
There is no direct "start from specific timestamp" option in Autoloader's checkpointing. Workarounds rely on file-based navigation (startAfter) or record filtering in the DataFrame. Always test in non-production before making checkpoint adjustments.