Databricks Auto Loader is a powerful tool for efficiently ingesting files from cloud storage. In this blog, we'll focus on setting up Auto Loader in file notification mode on AWS using service credentials.
File notification mode is more performant and scalable than directory listing mode, especially for large input directories or high volumes of files. It leverages AWS services like SNS and SQS to detect new files, allowing Auto Loader to scale to ingest millions of files per hour.
Previously, Databricks Auto Loader in file notification mode relied on instance profiles, which limited its use to single-user compute clusters. Customers using Unity Catalog (UC) had to provide additional credentials for setting up and consuming notifications, but there was no secure way to manage these credentials in shared and serverless compute environments. As a result, Directory Listing Mode was the only supported option for Auto Loader in these environments.
Cloud Service Credentials (in Unity Catalog) solve this issue by providing a secure way to manage credential access while enabling file notifications across all compute types — standard (or shared), dedicated (or single user), and serverless. The service credential allows users to securely and efficiently configure Auto Loader to access cloud storage.
To use this feature, your compute must run Databricks Runtime 16.2 or later. As serverless compute environments adopt newer DBR versions, this functionality will also become available in those environments, including in DLT pipelines.
Before proceeding, let's clarify a few essential concepts in AWS:
Now, let's walk through a series of steps to set up Autoloader.
To begin, we can either set up a new S3 bucket or reuse an existing one for which we have the necessary permissions. This bucket will store input files for Auto Loader processing.
Next, we need to create IAM roles that Auto Loader will use to access AWS services. This includes access to S3 buckets as well as the use of SNS and SQS. Here, I am creating two IAM roles—one for the storage credential and one for the service credential—but you can combine these as two policies within a single IAM role.
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::<s3_bucket_in_step_1>/*",
"arn:aws:s3:::<s3_bucket_in_step_1>"
],
"Effect": "Allow"
},
{
"Action": [
"sts:AssumeRole"
],
"Resource": [
"arn:aws:iam::<aws_account_id>:role/<iam_role_in_step_2>"
],
"Effect": "Allow"
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL",
"arn:aws:iam::<aws_account_id>:role/<iam_role_in_step_2>"
]
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": [
"0000"
]
}
}
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DatabricksAutoLoaderSetup",
"Effect": "Allow",
"Action": [
"s3:GetBucketNotification",
"s3:PutBucketNotification",
"sns:ListSubscriptionsByTopic",
"sns:GetTopicAttributes",
"sns:SetTopicAttributes",
"sns:CreateTopic",
"sns:TagResource",
"sns:Publish",
"sns:Subscribe",
"sqs:CreateQueue",
"sqs:DeleteMessage",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:GetQueueUrl",
"sqs:GetQueueAttributes",
"sqs:SetQueueAttributes",
"sqs:TagQueue",
"sqs:ChangeMessageVisibility"
],
"Resource": [
"arn:aws:s3:::<s3_bucket_in_step_1>",
"arn:aws:sqs:<aws_region>:<aws_account_id>:databricks-auto-ingest-*",
"arn:aws:sns:<aws_region>:<aws_account_id>:databricks-auto-ingest-*"
]
},
{
"Sid": "DatabricksAutoLoaderList",
"Effect": "Allow",
"Action": [
"sqs:ListQueues",
"sqs:ListQueueTags",
"sns:ListTopics"
],
"Resource": "*"
},
{
"Sid": "DatabricksAutoLoaderTeardown",
"Effect": "Allow",
"Action": [
"sns:Unsubscribe",
"sns:DeleteTopic",
"sqs:DeleteQueue"
],
"Resource": [
"arn:aws:sqs:<aws_region>:<aws_account_id>:databricks-auto-ingest-*",
"arn:aws:sns:<aws_region>:<aws_account_id>:databricks-auto-ingest-*"
]
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DatabricksAutoLoaderUse",
"Effect": "Allow",
"Action": [
"s3:GetBucketNotification",
"sns:ListSubscriptionsByTopic",
"sns:GetTopicAttributes",
"sns:TagResource",
"sns:Publish",
"sqs:DeleteMessage",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:GetQueueUrl",
"sqs:GetQueueAttributes",
"sqs:TagQueue",
"sqs:ChangeMessageVisibility"
],
"Resource": [
"arn:aws:sqs:<aws_region>:<aws_account_id>:<sqs_for_auto_loader>",
"arn:aws:sns:<aws_region>:<aws_account_id>:<sns_for_auto_loader>",
"arn:aws:s3:::<s3_bucket_in_step_1>"
]
}
]
}
[Optional] If you created a single role for both the storage and service credentials, then add both External IDs (from Steps 3 and 5) to the trust policy as: "sts:ExternalId": ["0000", "1111"].
Now, we can set up an Auto Loader using PySpark. Run and test the Auto Loader code from a Notebook in the Databricks workspace:
options = {
"cloudFiles.format": "parquet", # Adjust based on your file format
"cloudFiles.schemaLocation": "s3://<s3_bucket_in_step_1>/schema/",
"cloudFiles.useNotifications": True,
"databricks.serviceCredential": "<name_of_service_credential_in_step_5>",
"cloudFiles.region": '<AWS_region_containing_S3_SNS_SQS>', # [OPTIONAL] Add this if your S3 bucket, SNS, SQS is in a different region than the Databricks workspace
"cloudFiles.queueUrl": '<SQS_URL_which_auto_loader_needs_to_read_from>' # [OPTIONAL] Add this if you have manually configured the SQS in Step 2.5.b
}
df = (
spark.readStream.format("cloudFiles")
.options(**options)
.load("s3://<s3_bucket_in_step_1>/input/")
)
(
df.writeStream.option("checkpointLocation", "s3://<s3_bucket_in_step_1>/checkpoint/")
.trigger(availableNow=True)
.start("s3://<s3_bucket_in_step_1>/output/")
)
The above example uses Python, but you can achieve the same with Scala.
When setting up Databricks Auto Loader to access an S3 bucket in a different AWS account from the Databricks workspace, the core setup steps remain essentially the same. However, there are a few key considerations to keep in mind:
With this approach, Databricks can securely connect to the S3 bucket and its SNS/SQS resources in a separate AWS account, leveraging the trust policy and Unity Catalog service credential. This cross-account setup ensures secure and efficient data ingestion while maintaining proper access controls between AWS accounts.
Following these steps, you've set up Databricks Auto Loader in file notification mode on AWS using service credentials. This configuration allows efficient, scalable data ingestion from S3 into your Databricks environment.
For information on setting up Auto Loader on other cloud platforms:
Remember, while the core concepts are similar across cloud providers, the specific services and configuration steps may vary. While service credentials can be set up for all clouds, cross-cloud credentials are not supported.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.