Databricks Community

dikokob · ‎09-01-2024

Hello Databricks Community,

I'm encountering an issue with the Databricks Autoloader where, after running successfully for a period of time, it suddenly stops detecting new files in the source directory. This issue only gets resolved when I reset the checkpoint, which forces Autoloader to reprocess all files from scratch. This behavior is unexpected and has disrupted our data pipeline operations. I'm seeking help to understand and resolve this issue.

Environment Details:

Databricks Runtime Version: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
Cloud Platform: Azure

Autoloader Configuration:

File Format: JSON
Directory Structure: Files are placed in a flat directory structure in cloud storage (e.g., AWS S3, Azure Blob Storage).
File Arrival: Files are added incrementally and may sometimes be overwritten.

Code Setup:

sdf = (spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.allowOverwrites", 'true')
  .option("cloudFiles.inferColumnTypes", "true")
  .option("badRecordsPath", bad_records_path)
  .schema(schema)
  .load(loading_path))

(sdf.writeStream
  .format("delta")
  .outputMode("append")
  .option("mergeSchema", "true")
  .option("badRecordsPath", bad_records_path)
  .foreachBatch(upsert_to_delta)
  .option("checkpointLocation", checkpoint_path)
  .start(table_path))

Problem Description:

Issue: Autoloader stops detecting new files after it has been working successfully for some time.
Resolution Attempted: Resetting the checkpoint path resolves the issue temporarily, allowing Autoloader to detect and process new files again. However, this approach is not ideal as it forces reprocessing of all files, leading to potential duplication and increased processing time.
Expected Behavior: Autoloader should continuously detect and process new files as they are added to the loading_path, without needing to reset the checkpoint.

Steps Taken for Troubleshooting:

Verified that the checkpoint location is consistent and has the correct permissions.
Checked the naming pattern and directory structure of new files.
Ensured no manual changes are made to the checkpoint directory.
Monitored the logs for any specific error messages or warnings related to file detection.
Reviewed and confirmed that there were no significant changes to the data source or the processing logic during the period when the issue began.

Additional Context:

Files are sometimes overwritten in the source directory, which is why cloudFiles.allowOverwrites is set to true.
Schema changes may occur, so mergeSchema and inferColumnTypes options are enabled to handle schema evolution.
Using a custom upsert_to_delta function for handling upserts into Delta tables.
The issue occurred unexpectedly, despite the Autoloader working without problems for a considerable amount of time.

Questions:

What could be causing Autoloader to suddenly stop detecting new files after working fine for a while?
Are there any specific best practices or configurations for managing checkpoints and file detection with cloudFiles.mode that might prevent this from happening?
How can I avoid having to reset the checkpoint to detect new files?

Any insights, suggestions, or similar experiences would be greatly appreciated!

Thank you!

daniel_sahal · ‎09-02-2024

@dikokob That's a weird issue however, there are two things that I would check in the first place:
- cloudFiles.maxFileAge, if set to None, that's fine. If it's other value - that could cause an issue (https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html#max-file-a...)

- cloudFiles.backfillInterval - it's worth trying setting that to at least once a week (https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html#trigger-re...)

I would also check the bad_records_path directory - maybe somehow files and up in there due to schema inference.

dikokob · ‎09-02-2024

Hi @daniel_sahal ,

Thanks for the response:

cloudFiles.maxFileAge - isn't set
cloudFiles.backfillInterval - Will it still be worth using it even though I am not using file notification mode. Also, I have autoloader running for a live system which runs once a week and I need 100% of the data during the live streaming, what interval would you suggest?
The bad_records_path directory is empty. I would suspect a schema issue if I didn't get the data after deleting/reseting the checkpoint.

daniel_sahal · ‎09-02-2024

@dikokob Yes, setting backfillinterval should be still worth it even without file notification mode.

The schema inference issue could be handled differently when running full reload vs. incremental load. Imagine a situation with JSON files, when you've got a complex data type that is constantly changing - during the initial load, it would take a sample of 1k files and merge their schema. When doing incremental load it could end up a little bit differently.

One thing that came into my mind is lack of .awaitTermination() at the end of the write. This could cause a failure to not be visible thus you might think that your code completed without failures.
https://api-docs.databricks.com/python/pyspark/latest/pyspark.ss/api/pyspark.sql.streaming.Streaming...

dikokob · ‎09-02-2024

@daniel_sahal,

Any suggestion on a backfillinterval, that could work best for my use case?

I will also look into the .awaitTermination() and test that out.

I am mainly looking for a solution that will increase our ingestion's robustness.

IslaCarr · ‎10-07-2024

Have you found something?

I am waiting for your reply and till then I will search for reviews online for academized.com website. Because my brother wants to hire an essay writer from the given website and that is why I want to read reviews.

Databricks Community

Databricks Autoloader Checkpoint

Connect with Databricks Users in Your Area

Join Us as a Community Technical Moderator

Databricks Community Champion - October 2024 - Filip Niziol

Become Our Next Monthly Community Champion!

Introducing Simple, Fast, and Scalable Batch LLM Inference on Mosaic AI Model Serving

Databricks Migration Strategy: Lessons Learned