Hi @harvey-c, Certainly! Databricks Auto Loader provides several configuration options for efficiently ingesting data from an S3 bucket or directory.
Let’s focus on the relevant options for listing only new files:
cloudFiles.allowOverwrites: This boolean option controls whether input directory file changes can overwrite existing data. It’s available in Databricks Runtime 7.6 and above. By default, it’s set to false.
cloudFiles.backfillInterval: In asynchronous backfill mode, Auto Loader triggers backfills at a specified interval (e.g., once a day or once a week). Backfills help ensure that all files eventually get processed, especially when file event notification systems don’t guarantee 100% delivery of uploaded files. Available in Databricks Runtime 8.4 and above (unsupported).
cloudFiles.includeExistingFiles: This boolean option determines whether to include existing files in the stream processing input path or only process new files arriving after initial setup. It’s evaluated only when you start a stream for the first time and has no effect after restarting the stream. By default, it’s set to true.
cloudFiles.inferColumnTypes: When leveraging schema inference, this boolean option controls whether to infer exact column types. By default, columns are inferred as strings when inferring JSON and CSV datasets. Set it to true if you want precise column type inference.
cloudFiles.maxBytesPerTrigger: Specify the maximum number of new bytes to process in each trigger. For example, you can limit each microbatch to 10 GB of data. This is a soft maximum, and Databricks processes up to the lower limit of either cloudFiles.maxFilesPerTrigger or cloudFiles.maxBytesPerTrigger, whichever is reached first.
Remember that Auto Loader’s directory listing mode allows you to quickly start streams without additional permission configurations beyond access to your data on cloud storage.
Happy data ingestion! 🚀
For more details, you can refer to the official Databricks documentation.