cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks Autoloader Checkpoint

boitumelodikoko
Contributor III

Hello Databricks Community,

I'm encountering an issue with the Databricks Autoloader where, after running successfully for a period of time, it suddenly stops detecting new files in the source directory. This issue only gets resolved when I reset the checkpoint, which forces Autoloader to reprocess all files from scratch. This behavior is unexpected and has disrupted our data pipeline operations. I'm seeking help to understand and resolve this issue.

Environment Details:

  • Databricks Runtime Version: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
  • Cloud Platform: Azure

Autoloader Configuration:

  • File Format: JSON
  • Directory Structure: Files are placed in a flat directory structure in cloud storage (e.g., AWS S3, Azure Blob Storage).
  • File Arrival: Files are added incrementally and may sometimes be overwritten.

Code Setup:

sdf = (spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.allowOverwrites", 'true')
  .option("cloudFiles.inferColumnTypes", "true")
  .option("badRecordsPath", bad_records_path)
  .schema(schema)
  .load(loading_path))

(sdf.writeStream
  .format("delta")
  .outputMode("append")
  .option("mergeSchema", "true")
  .option("badRecordsPath", bad_records_path)
  .foreachBatch(upsert_to_delta)
  .option("checkpointLocation", checkpoint_path)
  .start(table_path))

Problem Description:

  • Issue: Autoloader stops detecting new files after it has been working successfully for some time.
  • Resolution Attempted: Resetting the checkpoint path resolves the issue temporarily, allowing Autoloader to detect and process new files again. However, this approach is not ideal as it forces reprocessing of all files, leading to potential duplication and increased processing time.
  • Expected Behavior: Autoloader should continuously detect and process new files as they are added to the loading_path, without needing to reset the checkpoint.

Steps Taken for Troubleshooting:

  1. Verified that the checkpoint location is consistent and has the correct permissions.
  2. Checked the naming pattern and directory structure of new files.
  3. Ensured no manual changes are made to the checkpoint directory.
  4. Monitored the logs for any specific error messages or warnings related to file detection.
  5. Reviewed and confirmed that there were no significant changes to the data source or the processing logic during the period when the issue began.

Additional Context:

  • Files are sometimes overwritten in the source directory, which is why cloudFiles.allowOverwrites is set to true.
  • Schema changes may occur, so mergeSchema and inferColumnTypes options are enabled to handle schema evolution.
  • Using a custom upsert_to_delta function for handling upserts into Delta tables.
  • The issue occurred unexpectedly, despite the Autoloader working without problems for a considerable amount of time.

Questions:

  1. What could be causing Autoloader to suddenly stop detecting new files after working fine for a while?
  2. Are there any specific best practices or configurations for managing checkpoints and file detection with cloudFiles.mode that might prevent this from happening?
  3. How can I avoid having to reset the checkpoint to detect new files?

Any insights, suggestions, or similar experiences would be greatly appreciated!

Thank you!


Thanks,
Boitumelo
1 ACCEPTED SOLUTION

Accepted Solutions

boitumelodikoko
Contributor III

I have found that reducing the number of objects in the landing path (via an archive/cleanup process) is the most reliable fix. Auto Loader's file discovery can bog down in big/"long-lived" landing foldersโ€”especially in directory-listing modeโ€”so cleaning or moving processed files keeps discovery fast and avoids odd "no new files" stalls. Databricks now exposes this as a first-class knob: cloudFiles.cleanSource with either MOVE (to an archive path) or DELETE. Microsoft Learn+1

 

Here's a concise root-cause + hardening checklist you can keep:

  • #1: Too many files in landing โ†’ incremental listing stalls
  • Keep the landing zone small. Enable cloudFiles.cleanSource and point cloudFiles.cleanSource.moveDestination to an archive. You already did this manually; using the built-in option makes it automatic and consistent. (Requires newer DBRโ€”see docs.) Microsoft Learn
  • Directory-listing vs. file-notification
  • Listing mode is the default and simplest, but it can miss or delay files under certain patterns and large directories. If feasible, switch to file-notification mode (uses storage events + queue) or disable incremental listing per the KB guidance for stubborn cases. Microsoft Learnkb.databricks.com
  • Overwrites & checkpoints
  • With cloudFiles.allowOverwrites=true, Auto Loader will re-ingest the latest version of a file, but mixed signals (mod times vs. event times) and state in the RocksDB checkpoint can cause non-intuitive behaviour if files are edited in place. Your archive flow reduces these edge cases. Microsoft LearnDatabricks DocumentationLakeFS
  • General production hardening
  • Follow the production guidelines (Lakeflow/streaming best practices), keep write permissions for source + archive path, avoid filenames starting with _, and monitor the stream state. Microsoft Learn+1Medium

Drop-in config (Azure)

sdf = (spark.readStream.format("cloudFiles")

  .option("cloudFiles.format", "json")

  .option("cloudFiles.allowOverwrites", "true")

  # Auto-archive processed files:

  .option("cloudFiles.cleanSource", "MOVE") # or DELETE

  .option("cloudFiles.cleanSource.moveDestination", "abfss://raw@acnt.dfs.core.windows.net/archive/myfeed/")

  .option("cloudFiles.cleanSource.retentionDuration", "30 days") # default; can be shorter for MOVE

  .option("badRecordsPath", bad_records_path)

  .schema(schema)

  .load(loading_path))

 

Notes: cleanSource runs cleanup after retention; ensure the service principal has write on both the landing and archive paths. These options are documented in the current Auto Loader options reference. Microsoft Learn

When to prefer file-notification mode

If your landing zone is high-churn or very large, wiring storage events + queue (Azure: Event Grid โ†’ Queue) gives more deterministic discovery and less listing overhead. The KB on "fails to pick up new files in directory listing mode" explicitly recommends notification mode or disabling incremental listing in tricky cases. kb.databricks.com


Thanks,
Boitumelo

View solution in original post

7 REPLIES 7

daniel_sahal
Esteemed Contributor

@boitumelodikoko That's a weird issue however, there are two things that I would check in the first place:
cloudFiles.maxFileAge, if set to None, that's fine. If it's other value - that could cause an issue (https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html#max-file-a...)

- cloudFiles.backfillInterval - it's worth trying setting that to at least once a week (https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html#trigger-re...)

I would also check the bad_records_path directory - maybe somehow files and up in there due to schema inference.

Hi @daniel_sahal ,

Thanks for the response:

  • cloudFiles.maxFileAge - isn't set
  • cloudFiles.backfillInterval - Will it still be worth using it even though I am not using file notification mode. Also, I have autoloader running for a live system which runs once a week and I need 100% of the data during the live streaming, what interval would you suggest? 
  •  The bad_records_path directory is empty. I would suspect a schema issue if I didn't get the data after deleting/reseting the checkpoint.

Thanks,
Boitumelo

@boitumelodikoko Yes, setting backfillinterval should be still worth it even without file notification mode.

The schema inference issue could be handled differently when running full reload vs. incremental load. Imagine a situation with JSON files, when you've got a complex data type that is constantly changing - during the initial load, it would take a sample of 1k files and merge their schema. When doing incremental load it could end up a little bit differently.

One thing that came into my mind is lack of .awaitTermination() at the end of the write. This could cause a failure to not be visible thus you might think that your code completed without failures.
https://api-docs.databricks.com/python/pyspark/latest/pyspark.ss/api/pyspark.sql.streaming.Streaming...

@daniel_sahal,

Any suggestion on a backfillinterval, that could work best for my use case?

I will also look into the .awaitTermination() and test that out.

I am mainly looking for a solution that will increase our ingestion's robustness.


Thanks,
Boitumelo

ChristianRRL
Valued Contributor III

We've run into this issue as well with DLT (Autoloader under the hood). Unfortunately, I don't know if there is a definitive answer out on this yet. The closest thing I got on my post is that: "Autoloader, in general, is highly recommended for ingestion where the files are immutable."

Didn't love that answer to be honest, after all I would expect that if the feature is available it should work as advertised.

Is this still an outstanding issue? Have you had any luck getting this resolved? I'd be curious to get a better understanding of the root cause of this.

boitumelodikoko
Contributor III

I have found that reducing the number of objects in the landing path (via an archive/cleanup process) is the most reliable fix. Auto Loader's file discovery can bog down in big/"long-lived" landing foldersโ€”especially in directory-listing modeโ€”so cleaning or moving processed files keeps discovery fast and avoids odd "no new files" stalls. Databricks now exposes this as a first-class knob: cloudFiles.cleanSource with either MOVE (to an archive path) or DELETE. Microsoft Learn+1

 

Here's a concise root-cause + hardening checklist you can keep:

  • #1: Too many files in landing โ†’ incremental listing stalls
  • Keep the landing zone small. Enable cloudFiles.cleanSource and point cloudFiles.cleanSource.moveDestination to an archive. You already did this manually; using the built-in option makes it automatic and consistent. (Requires newer DBRโ€”see docs.) Microsoft Learn
  • Directory-listing vs. file-notification
  • Listing mode is the default and simplest, but it can miss or delay files under certain patterns and large directories. If feasible, switch to file-notification mode (uses storage events + queue) or disable incremental listing per the KB guidance for stubborn cases. Microsoft Learnkb.databricks.com
  • Overwrites & checkpoints
  • With cloudFiles.allowOverwrites=true, Auto Loader will re-ingest the latest version of a file, but mixed signals (mod times vs. event times) and state in the RocksDB checkpoint can cause non-intuitive behaviour if files are edited in place. Your archive flow reduces these edge cases. Microsoft LearnDatabricks DocumentationLakeFS
  • General production hardening
  • Follow the production guidelines (Lakeflow/streaming best practices), keep write permissions for source + archive path, avoid filenames starting with _, and monitor the stream state. Microsoft Learn+1Medium

Drop-in config (Azure)

sdf = (spark.readStream.format("cloudFiles")

  .option("cloudFiles.format", "json")

  .option("cloudFiles.allowOverwrites", "true")

  # Auto-archive processed files:

  .option("cloudFiles.cleanSource", "MOVE") # or DELETE

  .option("cloudFiles.cleanSource.moveDestination", "abfss://raw@acnt.dfs.core.windows.net/archive/myfeed/")

  .option("cloudFiles.cleanSource.retentionDuration", "30 days") # default; can be shorter for MOVE

  .option("badRecordsPath", bad_records_path)

  .schema(schema)

  .load(loading_path))

 

Notes: cleanSource runs cleanup after retention; ensure the service principal has write on both the landing and archive paths. These options are documented in the current Auto Loader options reference. Microsoft Learn

When to prefer file-notification mode

If your landing zone is high-churn or very large, wiring storage events + queue (Azure: Event Grid โ†’ Queue) gives more deterministic discovery and less listing overhead. The KB on "fails to pick up new files in directory listing mode" explicitly recommends notification mode or disabling incremental listing in tricky cases. kb.databricks.com


Thanks,
Boitumelo

This is a phenomenal explanation. Thanks a ton!

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now