Databricks Community

MRTN · ‎04-28-2023

I am currently using two streams to monitor data in two different containers on an Azure storage account. Is there any way to configure an autoloader to read from two different locations? The schemas of the files are identical.

Anonymous · ‎05-13-2023

@Morten Stakkeland :

Yes, it's possible to configure an autoloader to read from multiple locations.

You can define multiple CloudFiles sources for the autoloader, each pointing to a different container in the same storage account. In your case, since the schemas of the files are identical, you can use the same schema for both sources. Here's an example of how you can define multiple sources in your autoloader configuration:

{
  "format": "delta",
  "mode": "append",
  "cloudFiles": {
    "cloudStorage": {
      "timeout": "1h",
      "accountName": "<storage-account-name>",
      "accountKey": "<storage-account-access-key>"
    },
    "useIncrementalListing": true,
    "maxConcurrentFileCount": 20,
    "source": [
      {
        "path": "/container1/",
        "globPattern": "*.csv",
        "recursive": true
      },
      {
        "path": "/container2/",
        "globPattern": "*.csv",
        "recursive": true
      }
    ]
  }
}

In this example, we define two sources, one for the /container1/ directory and one for the /container2/ directory. The globPattern parameter specifies that we only want to load CSV files, and the recursive parameter tells the autoloader to recursively search for files in subdirectories.

Note that you can also use different schemas for the two sources if necessary, as long as they have the same column names and data types.

View solution in original post

Anonymous · ‎05-13-2023

@Morten Stakkeland :

Yes, it's possible to configure an autoloader to read from multiple locations.

You can define multiple CloudFiles sources for the autoloader, each pointing to a different container in the same storage account. In your case, since the schemas of the files are identical, you can use the same schema for both sources. Here's an example of how you can define multiple sources in your autoloader configuration:

{
  "format": "delta",
  "mode": "append",
  "cloudFiles": {
    "cloudStorage": {
      "timeout": "1h",
      "accountName": "<storage-account-name>",
      "accountKey": "<storage-account-access-key>"
    },
    "useIncrementalListing": true,
    "maxConcurrentFileCount": 20,
    "source": [
      {
        "path": "/container1/",
        "globPattern": "*.csv",
        "recursive": true
      },
      {
        "path": "/container2/",
        "globPattern": "*.csv",
        "recursive": true
      }
    ]
  }
}

In this example, we define two sources, one for the /container1/ directory and one for the /container2/ directory. The globPattern parameter specifies that we only want to load CSV files, and the recursive parameter tells the autoloader to recursively search for files in subdirectories.

Note that you can also use different schemas for the two sources if necessary, as long as they have the same column names and data types.

MRTN · ‎05-16-2023

@Suteja Kanuri Thanks for this useful answer! In the meantime - we have moved onto using File Notification mode on Azure. Can we use the same "source" key to monitor two folders in this case?

lprevost · ‎08-07-2024

I can't find this documented anywhere. If this is possible, this could be a game changer for me. I'm racking my brain trying to figure out how to simultaneously work within the limits of autoloaders checkpoints while also breaking my large directories into smaller bites. I know I've got the maxBytesPerTrigger but am trying to incrementally load my lake without have to reset the checkpoints. I'm trying this and the validate function of my DLT doesn't like that path is not specified via the source list.

Databricks Community

Configure multiple source paths for auto loader

Connect with Databricks Users in Your Area

Meet the Databricks MVPs

Databricks training invests in closing the data + AI skills gap across enterprises

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs

Data + AI Summit: Call for Presentations

Season's Speedings: Databricks SQL Delivers 4x Performance Boost Over Two Years