cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Notifications have file information but dataframe is empty using autoloader file notification mode

Abdul-Mannan
New Contributor III

Using DBR 13.3, i'm ingesting data from 1 adls storage account using autoloader with file notification mode enabled. and writing to container in another adls storage account. This is an older code which is using foreachbatch sink to process the data before merging with tables in delta lake.

Issue

Notifications are generated for new files and when streaming job runs, it shows that there is epoch_id generated for each batch processed in foreachBatch() but the dataframe for same epoch_id is appearing empty.

The file which is being pointed out in the notification does contain data, so its not the source data is empty.

Moreover, if I switch it to directory listing mode (default), it works.

Following file notification specific options are set:

 

 

options['cloudFiles.includeExistingFiles']  = 'false'
options["cloudFiles.subscriptionId"]        = cloudfiles_subscriptionid
options["cloudFiles.tenantId"]              = cloudfiles_tenantid
options["cloudFiles.clientId"]              = cloudfiles_clientid
options["cloudFiles.clientSecret"]          = cloudfiles_clientsecret
options["cloudFiles.resourceGroup"]         = cloudfiles_resourcegroup
options["cloudFiles.fetchParallelism"]      = 5
options["cloudFiles.resourceTag.streaming_job_autoloader_file_notification_enabled"]  = 'true'
options["cloudFiles.resourceTag.streaming_job_autoloader_stream_id"]  = 'some_id'
options["cloudFiles.queueName"]             = "some_pregenerated_queue"

 

 

 

1 REPLY 1

Walter_C
Databricks Employee
Databricks Employee

Here are some potential steps and considerations to troubleshoot and resolve the issue:

  1. Permissions and Configuration:

    • Ensure that the necessary permissions are correctly set up for file notification mode. This includes having the appropriate roles and permissions for the Azure Event Grid and Azure Queue Storage. The required roles include:
      • Contributor: For setting up resources in your storage account.
      • Storage Queue Data Contributor: For performing queue operations.
      • EventGrid EventSubscription Contributor: For performing Event Grid subscription operations.
  2. Event Grid Registration:

    • Verify that Event Grid is registered as a Resource Provider in your Azure subscription. If not, you can register it through the Azure portal under the Resource Providers section.
  3. File Notification Events:

    • Check if the file notification events are being correctly generated and processed. For ADLS Gen2, Auto Loader listens for the FlushWithClose event for processing a file. Ensure that this event is being triggered correctly.
  4. CloudFiles Options:

    • Review the cloudFiles options you have set. Ensure that all necessary options are correctly configured, including cloudFiles.subscriptionId, cloudFiles.tenantId, cloudFiles.clientId, cloudFiles.clientSecret, cloudFiles.resourceGroup, and cloudFiles.queueName.
  5. Backfill Interval:

    • Consider setting the cloudFiles.backfillInterval option to trigger regular backfills. This can help ensure that all files are discovered within a given SLA if data completeness is a requirement

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group