09-01-2024 11:56 PM - edited 09-01-2024 11:58 PM
Hello Databricks Community,
I'm encountering an issue with the Databricks Autoloader where, after running successfully for a period of time, it suddenly stops detecting new files in the source directory. This issue only gets resolved when I reset the checkpoint, which forces Autoloader to reprocess all files from scratch. This behavior is unexpected and has disrupted our data pipeline operations. I'm seeking help to understand and resolve this issue.
Environment Details:
Autoloader Configuration:
Code Setup:
sdf = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.allowOverwrites", 'true')
.option("cloudFiles.inferColumnTypes", "true")
.option("badRecordsPath", bad_records_path)
.schema(schema)
.load(loading_path))
(sdf.writeStream
.format("delta")
.outputMode("append")
.option("mergeSchema", "true")
.option("badRecordsPath", bad_records_path)
.foreachBatch(upsert_to_delta)
.option("checkpointLocation", checkpoint_path)
.start(table_path))
Problem Description:
Steps Taken for Troubleshooting:
Additional Context:
Questions:
Any insights, suggestions, or similar experiences would be greatly appreciated!
Thank you!
08-12-2025 09:48 PM
I have found that reducing the number of objects in the landing path (via an archive/cleanup process) is the most reliable fix. Auto Loader's file discovery can bog down in big/"long-lived" landing folders—especially in directory-listing mode—so cleaning or moving processed files keeps discovery fast and avoids odd "no new files" stalls. Databricks now exposes this as a first-class knob: cloudFiles.cleanSource with either MOVE (to an archive path) or DELETE. Microsoft Learn+1
Here's a concise root-cause + hardening checklist you can keep:
sdf = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.allowOverwrites", "true")
# Auto-archive processed files:
.option("cloudFiles.cleanSource", "MOVE") # or DELETE
.option("cloudFiles.cleanSource.moveDestination", "abfss://raw@acnt.dfs.core.windows.net/archive/myfeed/")
.option("cloudFiles.cleanSource.retentionDuration", "30 days") # default; can be shorter for MOVE
.option("badRecordsPath", bad_records_path)
.schema(schema)
.load(loading_path))
Notes: cleanSource runs cleanup after retention; ensure the service principal has write on both the landing and archive paths. These options are documented in the current Auto Loader options reference. Microsoft Learn
If your landing zone is high-churn or very large, wiring storage events + queue (Azure: Event Grid → Queue) gives more deterministic discovery and less listing overhead. The KB on "fails to pick up new files in directory listing mode" explicitly recommends notification mode or disabling incremental listing in tricky cases. kb.databricks.com
09-02-2024 04:22 AM - edited 09-02-2024 04:23 AM
@boitumelodikoko That's a weird issue however, there are two things that I would check in the first place:
- cloudFiles.maxFileAge, if set to None, that's fine. If it's other value - that could cause an issue (https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html#max-file-a...)
- cloudFiles.backfillInterval - it's worth trying setting that to at least once a week (https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html#trigger-re...)
I would also check the bad_records_path directory - maybe somehow files and up in there due to schema inference.
09-02-2024 05:12 AM
Hi @daniel_sahal ,
Thanks for the response:
09-02-2024 06:10 AM
@boitumelodikoko Yes, setting backfillinterval should be still worth it even without file notification mode.
The schema inference issue could be handled differently when running full reload vs. incremental load. Imagine a situation with JSON files, when you've got a complex data type that is constantly changing - during the initial load, it would take a sample of 1k files and merge their schema. When doing incremental load it could end up a little bit differently.
One thing that came into my mind is lack of .awaitTermination() at the end of the write. This could cause a failure to not be visible thus you might think that your code completed without failures.
https://api-docs.databricks.com/python/pyspark/latest/pyspark.ss/api/pyspark.sql.streaming.Streaming...
09-02-2024 06:55 AM
@daniel_sahal,
Any suggestion on a backfillinterval, that could work best for my use case?
I will also look into the .awaitTermination() and test that out.
I am mainly looking for a solution that will increase our ingestion's robustness.
08-12-2025 12:19 PM
We've run into this issue as well with DLT (Autoloader under the hood). Unfortunately, I don't know if there is a definitive answer out on this yet. The closest thing I got on my post is that: "Autoloader, in general, is highly recommended for ingestion where the files are immutable."
Didn't love that answer to be honest, after all I would expect that if the feature is available it should work as advertised.
Is this still an outstanding issue? Have you had any luck getting this resolved? I'd be curious to get a better understanding of the root cause of this.
08-12-2025 09:48 PM
I have found that reducing the number of objects in the landing path (via an archive/cleanup process) is the most reliable fix. Auto Loader's file discovery can bog down in big/"long-lived" landing folders—especially in directory-listing mode—so cleaning or moving processed files keeps discovery fast and avoids odd "no new files" stalls. Databricks now exposes this as a first-class knob: cloudFiles.cleanSource with either MOVE (to an archive path) or DELETE. Microsoft Learn+1
Here's a concise root-cause + hardening checklist you can keep:
sdf = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.allowOverwrites", "true")
# Auto-archive processed files:
.option("cloudFiles.cleanSource", "MOVE") # or DELETE
.option("cloudFiles.cleanSource.moveDestination", "abfss://raw@acnt.dfs.core.windows.net/archive/myfeed/")
.option("cloudFiles.cleanSource.retentionDuration", "30 days") # default; can be shorter for MOVE
.option("badRecordsPath", bad_records_path)
.schema(schema)
.load(loading_path))
Notes: cleanSource runs cleanup after retention; ensure the service principal has write on both the landing and archive paths. These options are documented in the current Auto Loader options reference. Microsoft Learn
If your landing zone is high-churn or very large, wiring storage events + queue (Azure: Event Grid → Queue) gives more deterministic discovery and less listing overhead. The KB on "fails to pick up new files in directory listing mode" explicitly recommends notification mode or disabling incremental listing in tricky cases. kb.databricks.com
08-13-2025 07:18 PM
This is a phenomenal explanation. Thanks a ton!
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now