cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoloader with file notifications on a queue that is on a different storage account to the blobs

ChrisLawford_n1
Contributor

Hello,

I am trying to set up autoloader using file notifications but as the storage account we are reading from is a premium storage account we have setup event subscriptions to pump the blob events to queues that exist on a standard gen 2 storage account. 

Storage account 1 (blobs):
RG: RG-001
Type: BlockBlobStorage

Storage account 2 (queues):
RG: RG-002
Type: StorageV2 (general purpose v2)

The databricks volume is linked with the storage account 1 and because of this we put the rest of the details into autoloader for the queue storage account thinking that this was the only way to tell autoloader where the queues exist.

At the moment we are configuring like below:

        spark.readStream.format("cloudFiles")
        .options(format='parquet', client_id='XXXXXXX', client_secret='XXXXXXXXX', tenant_id='XXXXXXX', subscription_id='XXXXXX', resource_group='RG-002', queue_root='cd', include_existing_files=False, use_notifications=True))
        .load(f"{volume}{data_path_structure}{table}/*")

this is running but no data is being processed.
My suspicion is that however autoloader actually uses those messages it is not recognizing that the storage account that it should be getting the data from is different from the one that the queue is on. 

Is there something I am missing?

2 REPLIES 2

lingareddy_Alva
Honored Contributor III

Hi @ChrisLawford_n1 

So in your case, it's not able to resolve the file paths from the event notifications
because they're pointing to a different storage account (Storage Account 1), which is not associated with the queue.

Use a StorageV2 Account for Both Blob and Queue
- Migrate your blob storage to a StorageV2 (Standard Gen2) account (i.e., move off Premium).
- Then configure native file notification support in the same account:
- Event Grid → Storage Queue
- Autoloader will now recognize and handle this properly

Custom Event Relay
If you must keep blob data in Premium:
- Use Azure Functions or Azure Logic Apps to read events from the Storage Account 2 queue and re-write or forward them to a queue on Storage Account 1 (or to a compatible service).
- But this is complex and fragile, and generally not recommended unless absolutely required.

Recommendation
For Autoloader with file notifications to work reliably and correctly, you should:
- Move data ingestion to a StorageV2 (Standard Gen2) account.
- Keep blob and queue in the same storage account.
- Let Databricks handle everything without extra routing logic.

or Use 

Use Event Hubs or Service Bus instead of Storage Queues:

 - Reconfigure Event Grid: Change your Event Grid subscription endpoint from Storage Queue to Service Bus            Topic/Queue or Event Hub
 - Update Autoloader config: Use Service Bus connection string instead of storage account details
 - This bypasses the cross-storage account issue since Service Bus isn't tied to a specific storage account

 

 

 

 

 

LR

Hey, 

I agree it would be ideal to have the data on a storage account that supports queues but unfortunately this is not in my control.
Regarding your option:

Use Event Hubs or Service Bus instead of Storage Queues:

 - Reconfigure Event Grid: Change your Event Grid subscription endpoint from Storage Queue to Service Bus            Topic/Queue or Event Hub
 - Update Autoloader config: Use Service Bus connection string instead of storage account details
 - This bypasses the cross-storage account issue since Service Bus isn't tied to a specific storage account

Can you please help me to understand how to configure autoloader to do this. I can't see any such options in the documentation: 
https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/option...

And regarding this option:

Custom Event Relay
If you must keep blob data in Premium:
- Use Azure Functions or Azure Logic Apps to read events from the Storage Account 2 queue and re-write or forward them to a queue on Storage Account 1 (or to a compatible service).
- But this is complex and fragile, and generally not recommended unless absolutely required.

Being that Storage account 1 is a premium storage account and doesn't support queues, we made an azure event subscription from the storage account system topics on Storage account 1 and having them output to a queue on storage account 2. I assume you are mainly talking about using a compatible service such as event hub in your other option.

Just for a bit of a rant: 
I would like to know why you couldn't have a queue on a separate storage account which is being populated with messages from another storage account work. It appears that the AWS version having the queueurl being completely separate from the storage works exactly as I would expect.
The messages on the queue being the system topic events have all the information to get the correct data:
{"topic":"/subscriptions/{SUB_ID}/resourceGroups/{RG_001}/providers/Microsoft.Storage/storageAccounts/{SA-001}","subject":"/blobServices/default/containers/rawzone/blobs/{PATH_TO_PARQUET}","eventType":"Microsoft.Storage.BlobCreated","id":"{ID}","data":{"api":"CreateFile","clientRequestId":"{CLIENT_REQUEST_ID}","requestId":"{REQUEST_ID}","eTag":"{ETAG}","contentType":"application/octet-stream","contentLength":0,"contentOffset":0,"blobType":"BlockBlob","blobProperties":[{"acl":[{"access":"u::rw,g::r,o::","permission":"0640","owner":"{OWNER_ID}","group":"$superuser"}]}],"blobUrl":"https://{SA-001}.blob.core.windows.net/rawzone/{PATH_TO_PARQUET}","url":"https://{SA-001}.dfs.core.windows.net/rawzone/{PATH_TO_PARQUET}","sequencer":"00000000000000000000000000031001000000000000dd40","identity":"{ID}","storageDiagnostics":{"batchId":"{BATCH_ID}"}},"dataVersion":"3","metadataVersion":"1","eventTime":"2025-07-29T12:47:45.7969224Z"}

The event contains the storage account information in the topic and bloburl fields.
Interestingly they must be using either the subject field to get the relative path to the parquet or the blob urls as no other fields contain the path to the parquet. So seems like they have the storageaccount name where the data actually is located but must not be using it. I guess It would be nice to know how autoloader is working under the hood.
My expectation would be that the autoloader configuration would be fine to have the path be to the location that you want to read data from and the additional configuration to read a queue that contains messages be able to come from anywhere.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now