cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Auto Loader ignores data with modifiedBefore

gilt
New Contributor

Hello, 

I am trying to ingest CSV data with Auto Loader from an Azure Data Lake. I want to perform batch ingestion by using a scheduled job and the following trigger: 

 

.trigger(availableNow=True)

 

The CSV files are generated by Azure Synapse Link. If more than five minutes have passed since the last recorded change to a table in Microsoft Dataverse, a new CSV file gets written to the data lake recording the changes made. In the following five minutes, the new CSV file can still get rows inserted into it if a change to the table is made.

If Auto Loader gets triggered right after a new CSV file is created, it could potentially miss out on changes that will be written to the (now already ingested) CSV file.

A solution that I thought would work was to use Auto Loader with the 

 

modifiedBefore

 

option, and specify a timestamp of 

 

datetime.utcnow() - timedelta(minutes=5)

 

This seemed to work at first: a file that isn't older than five minutes is successfully ignored. However, when Auto Loader is run again later, it doesn't ingest the CSV file that was previously ignored. It seems that Auto Loader registers the CSV file (which was less than 5 minutes old during the initial run) and therefore doesn't ingest the file during the second run when the CSV file is now older than 5 minutes. 

Is this the intended behavior of the use of the modifiedBefore option? Or are my observations wrong? If this is the intended behavior, are there any simple workarounds achievable by setting another option in Auto Loader?

Thanks for any help with this,

Gil

1 REPLY 1

Brahmareddy
Valued Contributor III

Hi @gilt,

How are you doing today?

As per my understanding, Consider adjusting the Auto Loader configuration since the modifiedBefore option seems to mark the file as processed during the first trigger, even if itโ€™s incomplete. This behavior might be expected because Auto Loader registers the file in its metadata. One potential solution is to introduce a delay in triggering the ingestion job, allowing enough time for the CSV file to be fully written. Alternatively, you could experiment with the ignoreChanges option, ensuring Auto Loader picks up files based on content rather than modification time. Also, consider using a watermarking strategy or checking for file size stability before ingestion to avoid partial file ingestion.

Give a try and let me know.

Regards,

Brahma

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group