cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
MohanaBasak
Databricks Employee
Databricks Employee

Databricks Auto Loader is a powerful tool for efficiently ingesting files from cloud storage. In this blog, we'll focus on setting up Auto Loader in file notification mode on AWS using service credentials.

Why Use File Notification Mode?

File notification mode is more performant and scalable than directory listing mode, especially for large input directories or high volumes of files. It leverages AWS services like SNS and SQS to detect new files, allowing Auto Loader to scale to ingest millions of files per hour.

Introduction to File Notification Mode with Cloud Service Credentials

Previously, Databricks Auto Loader in file notification mode relied on instance profiles, which limited its use to single-user compute clusters. Customers using Unity Catalog (UC) had to provide additional credentials for setting up and consuming notifications, but there was no secure way to manage these credentials in shared and serverless compute environments. As a result, Directory Listing Mode was the only supported option for Auto Loader in these environments.

Cloud Service Credentials (in Unity Catalog) solve this issue by providing a secure way to manage credential access while enabling file notifications across all compute types — standard (or shared), dedicated (or single user), and serverless. The service credential allows users to securely and efficiently configure Auto Loader to access cloud storage.

To use this feature, your compute must run Databricks Runtime 16.2 or later. As serverless compute environments adopt newer DBR versions, this functionality will also become available in those environments, including in DLT pipelines.

MohanaBasak_1-1745971266757.png

Before proceeding, let's clarify a few essential concepts in AWS:

Understanding Key Concepts:

  • IAM Role: Grants Databricks Unity Catalog the necessary permissions to access AWS services securely.
  • S3 (Simple Storage Service): Scalable storage service that stores and retrieves data, such as files for Auto Loader ingestion.
  • SNS (Simple Notification Service): Pub/sub messaging service that notifies SQS when a file lands in S3.
  • SQS (Simple Queue Service): A message queuing service that Auto Loader reads to detect new files.

Prerequisites

  1. An AWS account with appropriate permissions
  2. A Databricks workspace configured on AWS
  3. An S3 bucket to store your input data

Now, let's walk through a series of steps to set up Autoloader.

(Optional) Step 1: Create an S3 Bucket

To begin, we can either set up a new S3 bucket or reuse an existing one for which we have the necessary permissions. This bucket will store input files for Auto Loader processing.

Step 2: Create IAM Roles

Next, we need to create IAM roles that Auto Loader will use to access AWS services. This includes access to S3 buckets as well as the use of SNS and SQS. Here, I am creating two IAM roles—one for the storage credential and one for the service credential—but you can combine these as two policies within a single IAM role.

  1. Go to the AWS IAM console.
  2. Create a new role.
  3. Create an inline policy. This policy will let Databricks Storage Credential access the S3 bucket using Unity Catalog. (Optional) You can add KMS details to this policy if needed.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject",
                    "s3:ListBucket",
                    "s3:GetBucketLocation"
                ],
                "Resource": [
                    "arn:aws:s3:::<s3_bucket_in_step_1>/*",
                    "arn:aws:s3:::<s3_bucket_in_step_1>"
                ],
                "Effect": "Allow"
            },
            {
                "Action": [
                    "sts:AssumeRole"
                ],
                "Resource": [
                    "arn:aws:iam::<aws_account_id>:role/<iam_role_in_step_2>"
                ],
                "Effect": "Allow"
            }
        ]
    }
    
  4. Trust Relationship – This lets UC access the role.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": [
                        "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL",
                        "arn:aws:iam::<aws_account_id>:role/<iam_role_in_step_2>"
                    ]
                },
                "Action": "sts:AssumeRole",
                "Condition": {
                    "StringEquals": {
                        "sts:ExternalId": [
                            "0000"        
                         ]
                    }
                }
            }
        ]
    }
    
  5.  Create one more IAM role for the service credential (or you can create this as an additional policy to the existing role). The same trust relationship as Step 2.4 can be used for this one. The policies will let Auto Loader create/use SNS and SQS. You can either manually create SNS, SQS, and S3 event notifications or let Auto Loader create all the necessary resources automatically. I am providing policies for both types. You can use any one of these depending on your choice, or use (a) for the first-time setup and then change the policy to (b).
    1. [Option 1] This policy lets Databricks Auto Loader create an SNS and an SQS with the prefix databricks-auto-ingest-* during the stream initialization of the Auto Loader job.
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "DatabricksAutoLoaderSetup",
                  "Effect": "Allow",
                  "Action": [
                      "s3:GetBucketNotification",
                      "s3:PutBucketNotification",
                      "sns:ListSubscriptionsByTopic",
                      "sns:GetTopicAttributes",
                      "sns:SetTopicAttributes",
                      "sns:CreateTopic",
                      "sns:TagResource",
                      "sns:Publish",
                      "sns:Subscribe",
                      "sqs:CreateQueue",
                      "sqs:DeleteMessage",
                      "sqs:ReceiveMessage",
                      "sqs:SendMessage",
                      "sqs:GetQueueUrl",
                      "sqs:GetQueueAttributes",
                      "sqs:SetQueueAttributes",
                      "sqs:TagQueue",
                      "sqs:ChangeMessageVisibility"
                  ],
                  "Resource": [
                      "arn:aws:s3:::<s3_bucket_in_step_1>",
                      "arn:aws:sqs:<aws_region>:<aws_account_id>:databricks-auto-ingest-*",
                      "arn:aws:sns:<aws_region>:<aws_account_id>:databricks-auto-ingest-*"
                  ]
              },
              {
                  "Sid": "DatabricksAutoLoaderList",
                  "Effect": "Allow",
                  "Action": [
                      "sqs:ListQueues",
                      "sqs:ListQueueTags",
                      "sns:ListTopics"
                  ],
                  "Resource": "*"
              },
              {
                  "Sid": "DatabricksAutoLoaderTeardown",
                  "Effect": "Allow",
                  "Action": [
                      "sns:Unsubscribe",
                      "sns:DeleteTopic",
                      "sqs:DeleteQueue"
                  ],
                  "Resource": [
                      "arn:aws:sqs:<aws_region>:<aws_account_id>:databricks-auto-ingest-*",
                      "arn:aws:sns:<aws_region>:<aws_account_id>:databricks-auto-ingest-*"
                  ]
              }
          ]
      }
      
    2.  [Option 2] This policy lets Databricks Auto Loader access existing SNS and SQS. You can follow the steps here for instructions on creating SNS, SQS, and bucket event notifications.
      1. Create an SNS topic
      2. Create an SQS queue
      3. Create an S3 bucket event notification
      4. Add the below inline policy to the IAM role
        {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Sid": "DatabricksAutoLoaderUse",
                    "Effect": "Allow",
                    "Action": [
                        "s3:GetBucketNotification",
                        "sns:ListSubscriptionsByTopic",
                        "sns:GetTopicAttributes",
                        "sns:TagResource",
                        "sns:Publish",
                        "sqs:DeleteMessage",
                        "sqs:ReceiveMessage",
                        "sqs:SendMessage",
                        "sqs:GetQueueUrl",
                        "sqs:GetQueueAttributes",
                        "sqs:TagQueue",
                        "sqs:ChangeMessageVisibility"
                    ],
                    "Resource": [
                        "arn:aws:sqs:<aws_region>:<aws_account_id>:<sqs_for_auto_loader>",
                        "arn:aws:sns:<aws_region>:<aws_account_id>:<sns_for_auto_loader>",
                        "arn:aws:s3:::<s3_bucket_in_step_1>"
                    ]
                }
            ]
        }
        

Step 3: Create a Storage Credential in Unity Catalog

  1. In Databricks, navigate to the Catalog
  2. Go to External Data > Credentials
  3. Click "Create credential" and keep "Storage Credential"
  4. Enter a name and the ARN of your IAM role created in Step 2.3 (arn:aws:iam::<aws_account_id>:role/<iam_role_in_step_2>).
  5. Copy the External ID provided
  6. Edit the trust relationship of the storage credential IAM role (created in Step 2.3) and add the External Id in place of 0000.

Step 4: Create an External Location in Unity Catalog

  1. Go to Catalog > External Data > External Locations
  2. Click "Create external location” > Manual > Next.
  3. Enter a name and the S3 bucket URL created in Step 1 (s3://<s3_bucket_in_step_1>)
  4. Select the Storage Credential created in Step 3
  5. Click “Create”.

Step 5: Create a Service Credential in Unity Catalog

  1. Go to Catalog > External Data > Credentials
  2. Click "Create credential" and select "Service Credential"
  3. Enter a name and the ARN of your IAM role created in Step 2.5 (arn:aws:iam::<aws_account_id>:role/<iam_role_in_step_2>).
  4. Copy the External ID provided
  5. Edit the trust relationship of the service credential IAM role (created in Step 2.5) and replace 0000 with the appropriate External ID.

[Optional] If you created a single role for both the storage and service credentials, then add both External IDs (from Steps 3 and 5) to the trust policy as: "sts:ExternalId": ["0000", "1111"].

Step 6: Configure Auto Loader

Now, we can set up an Auto Loader using PySpark. Run and test the Auto Loader code from a Notebook in the Databricks workspace:

options = {
  "cloudFiles.format": "parquet", # Adjust based on your file format
  "cloudFiles.schemaLocation": "s3://<s3_bucket_in_step_1>/schema/",
  "cloudFiles.useNotifications": True,
  "databricks.serviceCredential": "<name_of_service_credential_in_step_5>",
  "cloudFiles.region": '<AWS_region_containing_S3_SNS_SQS>', # [OPTIONAL] Add this if your S3 bucket, SNS, SQS is in a different region than the Databricks workspace
  "cloudFiles.queueUrl": '<SQS_URL_which_auto_loader_needs_to_read_from>' # [OPTIONAL] Add this if you have manually configured the SQS in Step 2.5.b 
}

df = (
  spark.readStream.format("cloudFiles")
  .options(**options)
  .load("s3://<s3_bucket_in_step_1>/input/")
)

(
  df.writeStream.option("checkpointLocation", "s3://<s3_bucket_in_step_1>/checkpoint/")
  .trigger(availableNow=True)
  .start("s3://<s3_bucket_in_step_1>/output/")
)

The above example uses Python, but you can achieve the same with Scala.

Best Practices

  1. Monitor your SQS queue to ensure messages are being processed efficiently.
  2. Tear down notification services when they are no longer needed (use the service credential created in Step 5 to create the CloudFilesAWSResourceManager).
  3. Set up AWS fanout events to trigger multiple Auto Loaders from the same set of S3 files.
  4. If you require creating more than 100 file notification pipelines for a single S3 bucket, leverage a service such as AWS Lambda to fan out notifications from a single queue that listens to an entire bucket into directory-specific queues.
  5. To avoid conflicting event notifications, use a dedicated S3 bucket for Auto Loader or AWS fanout events to trigger multiple events.

Cross-Account Setup for Auto Loader

When setting up Databricks Auto Loader to access an S3 bucket in a different AWS account from the Databricks workspace, the core setup steps remain essentially the same. However, there are a few key considerations to keep in mind:

  • IAM Role Creation: Create the IAM role (Step 2) in the AWS account where the S3 bucket resides rather than in the Databricks workspace’s AWS account. You can follow the same steps mentioned in Step 2 to create the role and set up storage and service credentials.
  • SNS and SQS Location: To enable proper event notifications, ensure that SNS and SQS resources are also created in the same AWS account as the S3 bucket.

With this approach, Databricks can securely connect to the S3 bucket and its SNS/SQS resources in a separate AWS account, leveraging the trust policy and Unity Catalog service credential. This cross-account setup ensures secure and efficient data ingestion while maintaining proper access controls between AWS accounts.

Conclusion

Following these steps, you've set up Databricks Auto Loader in file notification mode on AWS using service credentials. This configuration allows efficient, scalable data ingestion from S3 into your Databricks environment.

For information on setting up Auto Loader on other cloud platforms:

Remember, while the core concepts are similar across cloud providers, the specific services and configuration steps may vary. While service credentials can be set up for all clouds, cross-cloud credentials are not supported.