Incremental updates to s3 csv files, autoloader, and delta lake updates

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

I'm using the Databricks autoloader to incrementally load a series of csv files on s3 which I update with an API. My tyipcal work process is to update only the latest year file each night. But, there are ocassions where previous years also get updated when there are updates to previous year records. In that case, I write over the CSV file for that year.

I'm following this guide:

Is there a way to trigger autoloader to see that previous year update? It works for me that when I add a new file (year), autoloader ingests it into my lake. But, when a previously ingested file gets updated with the same file name, it does not appear to do so. My assumption is that autoloader sees this file as already being injested (via filename heuristic), it ignores it as already been ingested.

Is there a way to trigger incremental via "update date" or some other method?

Am considering starting down the path of file notification services (SQS/SNS) to trigger the incremental file injestion.

any help on which path to use would be appreciated.