Auto Loader file notification in Databricks relies on Azure Event Grid’s BlobCreated event to trigger notifications for newly created files in Azure Data Lake Gen2. The issue you’re experiencing is a known limitation when files are written via certain methods—such as streamed writes or the Create File API—especially when they use Content-Type application/octet-stream and trigger creation events before the file is fully committed.
Issue Explanation
-
When files are written with the Create File API or via streaming, the BlobCreated event is triggered as soon as the file is initiated, not when it is completely written.
-
As a result, the corresponding Event Grid message may have contentLength = 0, so Auto Loader sees the file as empty and ignores it.
-
When files are uploaded manually (e.g., via Azure Portal/Storage Explorer), the event fires after the file is fully committed, the content type is often set to application/json, and the file is ingested correctly.
Workarounds and Solutions
1. Poll Mode Instead of File Notification
Switching Auto Loader to directory listing (poll) mode will periodically scan for files and pick up those that have finished writing, regardless of the initial event trigger or content type. This can be less real-time but is more robust with respect to such file commit timing issues.
2. Change External System's Write Method
-
If possible, update the external system to upload files in a single operation or set the appropriate content type (application/json), ensuring that BlobCreated events are only fired after the full file is committed.
-
Alternatively, the system could upload to a temporary location, then move the fully written file into the target directory when complete.
3. Event Subscription Advanced Filters
-
Your workaround to filter on additional event details (e.g., data.api: CreateFile) helps to catch more events but does not resolve the core issue, since Event Grid may still fire events for empty/partially committed files.
-
No direct configuration on the Event Grid side can guarantee that only fully committed, non-empty files trigger an event.
4. Post-Processing Validation
If notification mode is required, you might need to build a post-processing validation in your pipeline. For example, before ingesting files, validate their size/content to avoid processing empty files created by incomplete writes.
5. File Locking or Marker Files
Implement a marker file strategy: the external system writes a .tmp file or appends a special suffix, then renames or moves the file once the write is complete. Auto Loader can be configured to process only files without the .tmp suffix or marker.
Limitations
This is primarily a limitation of Azure’s event generation logic and how the storage API triggers these events, not Databricks Auto Loader itself. Some updates to Azure Event Grid and Auto Loader are in progress to improve this scenario, but no instant fix currently exists.