cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Auto Loader File Notification Mode not working with ADLS Gen2 and files written as a stream

rvo19941
New Contributor II

Dear,

I am working on a real-time use case and am therefore using Auto Loader with file notification to ingest json files from a Gen2 Azure Storage Account in real-time. Full refreshes of my table work fine but I noticed Auto Loader was not picking up new files landing in the storage account. I have checked the Queue Storage and it stays empty. However, when I manually add a file, a message is added to the queue and the file is processed as expected. 

After some digging I found out the external system writing the files to the storage account was written these files as a stream (when I inspect the properties of the files written by the external system, I see "application/octet-stream" as CONTENT-TYPE whereas when I manually add a file I see "application/json"). This event type is not matched by default by the event subscription created by Databricks.

I tried to add it to the advanced filters of the event subscription (with key pair data.api: CreateFile). This generates messages in the queue but because the Microsoft.Storage.BlobCreated event is triggered when the CopyBlob operation is initiated and no... and the Create File API call  first initiates files and then content is added to the file, the contentLength parameter of the corresponding message in the queue is set to 0 and Auto Loader considers the file to be empty, even though it's not. 

Is there a solution/work-around or is this a limitation of file notification? Thanks in advance!

2 REPLIES 2

Panda
Valued Contributor

@rvo19941 -  Can you share your autoloder config.

mark_ott
Databricks Employee
Databricks Employee

Auto Loader file notification in Databricks relies on Azure Event Grid’s BlobCreated event to trigger notifications for newly created files in Azure Data Lake Gen2. The issue you’re experiencing is a known limitation when files are written via certain methods—such as streamed writes or the Create File API—especially when they use Content-Type application/octet-stream and trigger creation events before the file is fully committed.

Issue Explanation

  • When files are written with the Create File API or via streaming, the BlobCreated event is triggered as soon as the file is initiated, not when it is completely written.

  • As a result, the corresponding Event Grid message may have contentLength = 0, so Auto Loader sees the file as empty and ignores it.

  • When files are uploaded manually (e.g., via Azure Portal/Storage Explorer), the event fires after the file is fully committed, the content type is often set to application/json, and the file is ingested correctly.

Workarounds and Solutions

1. Poll Mode Instead of File Notification

Switching Auto Loader to directory listing (poll) mode will periodically scan for files and pick up those that have finished writing, regardless of the initial event trigger or content type. This can be less real-time but is more robust with respect to such file commit timing issues.

2. Change External System's Write Method

  • If possible, update the external system to upload files in a single operation or set the appropriate content type (application/json), ensuring that BlobCreated events are only fired after the full file is committed.

  • Alternatively, the system could upload to a temporary location, then move the fully written file into the target directory when complete.

3. Event Subscription Advanced Filters

  • Your workaround to filter on additional event details (e.g., data.api: CreateFile) helps to catch more events but does not resolve the core issue, since Event Grid may still fire events for empty/partially committed files.

  • No direct configuration on the Event Grid side can guarantee that only fully committed, non-empty files trigger an event.

4. Post-Processing Validation

If notification mode is required, you might need to build a post-processing validation in your pipeline. For example, before ingesting files, validate their size/content to avoid processing empty files created by incomplete writes.

5. File Locking or Marker Files

Implement a marker file strategy: the external system writes a .tmp file or appends a special suffix, then renames or moves the file once the write is complete. Auto Loader can be configured to process only files without the .tmp suffix or marker.

Limitations

This is primarily a limitation of Azure’s event generation logic and how the storage API triggers these events, not Databricks Auto Loader itself. Some updates to Azure Event Grid and Auto Loader are in progress to improve this scenario, but no instant fix currently exists.