Hello,
I am trying to ingest CSV data with Auto Loader from an Azure Data Lake. I want to perform batch ingestion by using a scheduled job and the following trigger:
.trigger(availableNow=True)
The CSV files are generated by Azure Synapse Link. If more than five minutes have passed since the last recorded change to a table in Microsoft Dataverse, a new CSV file gets written to the data lake recording the changes made. In the following five minutes, the new CSV file can still get rows inserted into it if a change to the table is made.
If Auto Loader gets triggered right after a new CSV file is created, it could potentially miss out on changes that will be written to the (now already ingested) CSV file.
A solution that I thought would work was to use Auto Loader with the
modifiedBefore
option, and specify a timestamp of
datetime.utcnow() - timedelta(minutes=5)
This seemed to work at first: a file that isn't older than five minutes is successfully ignored. However, when Auto Loader is run again later, it doesn't ingest the CSV file that was previously ignored. It seems that Auto Loader registers the CSV file (which was less than 5 minutes old during the initial run) and therefore doesn't ingest the file during the second run when the CSV file is now older than 5 minutes.
Is this the intended behavior of the use of the modifiedBefore option? Or are my observations wrong? If this is the intended behavior, are there any simple workarounds achievable by setting another option in Auto Loader?
Thanks for any help with this,
Gil