cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Configure multiple source paths for auto loader

MRTN
New Contributor III

I am currently using two streams to monitor data in two different containers on an Azure storage account. Is there any way to configure an autoloader to read from two different locations? The schemas of the files are identical.

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Morten Stakkelandโ€‹ :

Yes, it's possible to configure an autoloader to read from multiple locations.

You can define multiple CloudFiles sources for the autoloader, each pointing to a different container in the same storage account. In your case, since the schemas of the files are identical, you can use the same schema for both sources. Here's an example of how you can define multiple sources in your autoloader configuration:

{
  "format": "delta",
  "mode": "append",
  "cloudFiles": {
    "cloudStorage": {
      "timeout": "1h",
      "accountName": "<storage-account-name>",
      "accountKey": "<storage-account-access-key>"
    },
    "useIncrementalListing": true,
    "maxConcurrentFileCount": 20,
    "source": [
      {
        "path": "/container1/",
        "globPattern": "*.csv",
        "recursive": true
      },
      {
        "path": "/container2/",
        "globPattern": "*.csv",
        "recursive": true
      }
    ]
  }
}

In this example, we define two sources, one for the /container1/ directory and one for the /container2/ directory. The globPattern parameter specifies that we only want to load CSV files, and the recursive parameter tells the autoloader to recursively search for files in subdirectories.

Note that you can also use different schemas for the two sources if necessary, as long as they have the same column names and data types.

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@Morten Stakkelandโ€‹ :

Yes, it's possible to configure an autoloader to read from multiple locations.

You can define multiple CloudFiles sources for the autoloader, each pointing to a different container in the same storage account. In your case, since the schemas of the files are identical, you can use the same schema for both sources. Here's an example of how you can define multiple sources in your autoloader configuration:

{
  "format": "delta",
  "mode": "append",
  "cloudFiles": {
    "cloudStorage": {
      "timeout": "1h",
      "accountName": "<storage-account-name>",
      "accountKey": "<storage-account-access-key>"
    },
    "useIncrementalListing": true,
    "maxConcurrentFileCount": 20,
    "source": [
      {
        "path": "/container1/",
        "globPattern": "*.csv",
        "recursive": true
      },
      {
        "path": "/container2/",
        "globPattern": "*.csv",
        "recursive": true
      }
    ]
  }
}

In this example, we define two sources, one for the /container1/ directory and one for the /container2/ directory. The globPattern parameter specifies that we only want to load CSV files, and the recursive parameter tells the autoloader to recursively search for files in subdirectories.

Note that you can also use different schemas for the two sources if necessary, as long as they have the same column names and data types.

MRTN
New Contributor III

@Suteja Kanuriโ€‹ Thanks for this useful answer! In the meantime - we have moved onto using File Notification mode on Azure. Can we use the same "source" key to monitor two folders in this case?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.