Notifications have file information but dataframe is empty using autoloader file notification mode
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-17-2025 07:28 AM
Using DBR 13.3, i'm ingesting data from 1 adls storage account using autoloader with file notification mode enabled. and writing to container in another adls storage account. This is an older code which is using foreachbatch sink to process the data before merging with tables in delta lake.
Issue
Notifications are generated for new files and when streaming job runs, it shows that there is epoch_id generated for each batch processed in foreachBatch() but the dataframe for same epoch_id is appearing empty.
The file which is being pointed out in the notification does contain data, so its not the source data is empty.
Moreover, if I switch it to directory listing mode (default), it works.
Following file notification specific options are set:
options['cloudFiles.includeExistingFiles'] = 'false'
options["cloudFiles.subscriptionId"] = cloudfiles_subscriptionid
options["cloudFiles.tenantId"] = cloudfiles_tenantid
options["cloudFiles.clientId"] = cloudfiles_clientid
options["cloudFiles.clientSecret"] = cloudfiles_clientsecret
options["cloudFiles.resourceGroup"] = cloudfiles_resourcegroup
options["cloudFiles.fetchParallelism"] = 5
options["cloudFiles.resourceTag.streaming_job_autoloader_file_notification_enabled"] = 'true'
options["cloudFiles.resourceTag.streaming_job_autoloader_stream_id"] = 'some_id'
options["cloudFiles.queueName"] = "some_pregenerated_queue"
- Labels:
-
Delta Lake
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-17-2025 10:46 AM
Here are some potential steps and considerations to troubleshoot and resolve the issue:
-
Permissions and Configuration:
- Ensure that the necessary permissions are correctly set up for file notification mode. This includes having the appropriate roles and permissions for the Azure Event Grid and Azure Queue Storage. The required roles include:
- Contributor: For setting up resources in your storage account.
- Storage Queue Data Contributor: For performing queue operations.
- EventGrid EventSubscription Contributor: For performing Event Grid subscription operations.
- Ensure that the necessary permissions are correctly set up for file notification mode. This includes having the appropriate roles and permissions for the Azure Event Grid and Azure Queue Storage. The required roles include:
-
Event Grid Registration:
- Verify that Event Grid is registered as a Resource Provider in your Azure subscription. If not, you can register it through the Azure portal under the Resource Providers section.
-
File Notification Events:
- Check if the file notification events are being correctly generated and processed. For ADLS Gen2, Auto Loader listens for the
FlushWithClose
event for processing a file. Ensure that this event is being triggered correctly.
- Check if the file notification events are being correctly generated and processed. For ADLS Gen2, Auto Loader listens for the
-
CloudFiles Options:
- Review the
cloudFiles
options you have set. Ensure that all necessary options are correctly configured, includingcloudFiles.subscriptionId
,cloudFiles.tenantId
,cloudFiles.clientId
,cloudFiles.clientSecret
,cloudFiles.resourceGroup
, andcloudFiles.queueName
.
- Review the
-
Backfill Interval:
- Consider setting the
cloudFiles.backfillInterval
option to trigger regular backfills. This can help ensure that all files are discovered within a given SLA if data completeness is a requirement
- Consider setting the

