cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Configure multiple source paths for auto loader

MRTN
New Contributor III

I am currently using two streams to monitor data in two different containers on an Azure storage account. Is there any way to configure an autoloader to read from two different locations? The schemas of the files are identical.

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Morten Stakkeland​ :

Yes, it's possible to configure an autoloader to read from multiple locations.

You can define multiple CloudFiles sources for the autoloader, each pointing to a different container in the same storage account. In your case, since the schemas of the files are identical, you can use the same schema for both sources. Here's an example of how you can define multiple sources in your autoloader configuration:

{
  "format": "delta",
  "mode": "append",
  "cloudFiles": {
    "cloudStorage": {
      "timeout": "1h",
      "accountName": "<storage-account-name>",
      "accountKey": "<storage-account-access-key>"
    },
    "useIncrementalListing": true,
    "maxConcurrentFileCount": 20,
    "source": [
      {
        "path": "/container1/",
        "globPattern": "*.csv",
        "recursive": true
      },
      {
        "path": "/container2/",
        "globPattern": "*.csv",
        "recursive": true
      }
    ]
  }
}

In this example, we define two sources, one for the /container1/ directory and one for the /container2/ directory. The globPattern parameter specifies that we only want to load CSV files, and the recursive parameter tells the autoloader to recursively search for files in subdirectories.

Note that you can also use different schemas for the two sources if necessary, as long as they have the same column names and data types.

View solution in original post

3 REPLIES 3

Anonymous
Not applicable

@Morten Stakkeland​ :

Yes, it's possible to configure an autoloader to read from multiple locations.

You can define multiple CloudFiles sources for the autoloader, each pointing to a different container in the same storage account. In your case, since the schemas of the files are identical, you can use the same schema for both sources. Here's an example of how you can define multiple sources in your autoloader configuration:

{
  "format": "delta",
  "mode": "append",
  "cloudFiles": {
    "cloudStorage": {
      "timeout": "1h",
      "accountName": "<storage-account-name>",
      "accountKey": "<storage-account-access-key>"
    },
    "useIncrementalListing": true,
    "maxConcurrentFileCount": 20,
    "source": [
      {
        "path": "/container1/",
        "globPattern": "*.csv",
        "recursive": true
      },
      {
        "path": "/container2/",
        "globPattern": "*.csv",
        "recursive": true
      }
    ]
  }
}

In this example, we define two sources, one for the /container1/ directory and one for the /container2/ directory. The globPattern parameter specifies that we only want to load CSV files, and the recursive parameter tells the autoloader to recursively search for files in subdirectories.

Note that you can also use different schemas for the two sources if necessary, as long as they have the same column names and data types.

MRTN
New Contributor III

@Suteja Kanuri​ Thanks for this useful answer! In the meantime - we have moved onto using File Notification mode on Azure. Can we use the same "source" key to monitor two folders in this case?

I can't find this documented anywhere.   If this is possible, this could be a game changer for me.  I'm racking my brain trying to figure out how to simultaneously work within the limits of autoloaders checkpoints while also breaking my large directories into smaller bites.   I know I've got the maxBytesPerTrigger but am trying to incrementally load my lake without have to reset the checkpoints.   I'm trying this and the validate function of my DLT doesn't like that path is not specified via the source list.   

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group