This is a step-by-step guide to set up an AWS cross-account Databricks Autoloader connection in the File Notification mode. This will allow you to automatically load data from an S3 bucket in one AWS account (Account A) into a Databricks workspace in another AWS account (Account B).
Databricks Autoloader can either automatically set up SNS and SQS, or we can manually create the resources and then use them in the Autoloader. In either case, we will need an instance profile in Account B to access the SNS and SQS in Account A.
Before proceeding, let's clarify the purpose of the relevant services in AWS.
SNS (Simple Notification Service)
SNS is a fully managed pub/sub messaging service that enables seamless message delivery from publishers to subscribers. Autoloader file notification uses an SNS to get notifications whenever a file lands in S3.
SQS (Simple Queue Service)
SQS is a fully managed message queuing service that offers scalable, reliable, and distributed message queues. Autoloader uses an SQS to durably store messages from SNS. When the autoloader stream is started, it processes messages from the SQS to identify the new files.
Role-Based Access to Buckets
To access S3 buckets in another AWS account, you need to define IAM roles with policies that grant the necessary permissions. These roles establish trust relationships and ensure secure access to resources across accounts.
Trust Relationship
A trust relationship in AWS IAM defines which entities are trusted to assume a particular IAM role. When setting up cross-account access, trust relationships determine which accounts or entities can assume roles in other accounts.
Bucket Policy
A bucket policy in AWS S3 sets permissions for objects within a bucket, controlling access at the bucket and object level. It's written in JSON format and specifies who can access the bucket and what actions they can perform.
Now that we understand the relevant AWS services, we can get started setting up a cross-account Autoloader connection.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::acc-a-autol-input"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::acc-a-autol-input/*"
]
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Example permissions",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account_b_id>:role/acc_b_instance_profile"
},
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket"
],
"Resource": "arn:aws:s3:::acc-a-autol-input"
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account_b_id>:role/acc_b_instance_profile"
},
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObjectAcl"
],
"Resource": "arn:aws:s3:::acc-a-autol-input/*"
}
]
}
For Autoloader to automatically create SNS-SQS, you will need an IAM role with access to create SNS and SQS:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DatabricksAutoLoaderSetup",
"Effect": "Allow",
"Action": [
"s3:GetBucketNotification",
"s3:PutBucketNotification",
"sns:ListSubscriptionsByTopic",
"sns:GetTopicAttributes",
"sns:SetTopicAttributes",
"sns:CreateTopic",
"sns:TagResource",
"sns:Publish",
"sns:Subscribe",
"sqs:CreateQueue",
"sqs:DeleteMessage",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:GetQueueUrl",
"sqs:GetQueueAttributes",
"sqs:SetQueueAttributes",
"sqs:TagQueue",
"sqs:ChangeMessageVisibility"
],
"Resource": [
"arn:aws:s3:::acc-a-autol-input",
"arn:aws:sqs:<aws_region>:<account_a_id>:databricks-auto-ingest-*",
"arn:aws:sns:<aws_region>:<account_a_id>:databricks-auto-ingest-*"
]
},
{
"Sid": "DatabricksAutoLoaderList",
"Effect": "Allow",
"Action": [
"sqs:ListQueues",
"sqs:ListQueueTags",
"sns:ListTopics"
],
"Resource": "*"
},
{
"Sid": "DatabricksAutoLoaderTeardown",
"Effect": "Allow",
"Action": [
"sns:Unsubscribe",
"sns:DeleteTopic",
"sqs:DeleteQueue"
],
"Resource": [
"arn:aws:sqs:<aws_region>:<account_a_id>:databricks-auto-ingest-*",
"arn:aws:sns:<aws_region>:<account_a_id>:databricks-auto-ingest-*"
]
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Statement",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account_b_id>:role/acc_b_instance_profile"
},
"Action": "sts:AssumeRole"
}
]
}
{
"Sid": "AssumeRoleAccA",
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "arn:aws:iam::<account_a_id>:role/acc_a_autol_auto_create_role"
}
options = {
"cloudFiles.format": "parquet",
"cloudFiles.schemaLocation": "s3://acc-a-autol-input/schema/auto/",
"cloudFiles.useNotifications": True,
"cloudFiles.roleArn": "arn:aws:iam::<account_a_id>:role/acc_a_autol_auto_create_role",
}
df = (
spark.readStream.format("cloudFiles")
.options(**options)
.load("s3://acc-a-autol-input/data/auto/")
)
(
df.writeStream.option("checkpointLocation", "s3://acc-a-autol-input/checkpoint/auto1/1/")
.start("s3://acc-a-autol-input/output/auto/")
)
This Autoloader code will auto-create an SNS and an SQS with names similar to:
These will get created once during the Stream Initialization of the Autoloader.
If you want to manually create SNS-SQS and link this to the Autoloader, follow these steps:
{
"Version": "2008-10-17",
"Id": "notificationPolicy",
"Statement": [
{
"Sid": "allowS3Notification",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": "SNS:Publish",
"Resource": "arn:aws:sns:<aws_region>:<account_a_id>:acc_a_autol_sns",
"Condition": {
"ArnLike": {
"aws:SourceArn": "arn:aws:s3:*:*:acc-a-autol-input"
}
}
}
]
}
{
"Version": "2008-10-17",
"Id": "notificationPolicy",
"Statement": [
{
"Sid": "allowS3Notification",
"Effect": "Allow",
"Principal": {
"AWS": "*"
},
"Action": "SQS:SendMessage",
"Resource": "arn:aws:sqs:<aws_region>:<account_a_id>:acc_a_autol_sqs",
"Condition": {
"ArnLike": {
"aws:SourceArn": "arn:aws:sns:<aws_region>:<account_a_id>:acc_a_autol_sns"
}
}
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DatabricksAutoLoaderUse",
"Effect": "Allow",
"Action": [
"s3:GetBucketNotification",
"sns:ListSubscriptionsByTopic",
"sns:GetTopicAttributes",
"sns:TagResource",
"sns:Publish",
"sqs:DeleteMessage",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:GetQueueUrl",
"sqs:GetQueueAttributes",
"sqs:TagQueue",
"sqs:ChangeMessageVisibility"
],
"Resource": [
"arn:aws:sqs:<aws_region>:<account_a_id>:acc_a_autol_sqs",
"arn:aws:sns:<aws_region>:<account_a_id>:acc_a_autol_sns",
"arn:aws:s3:::acc-a-autol-input"
]
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Statement",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<account_b_id>:role/acc_b_instance_profile"
},
"Action": "sts:AssumeRole"
}
]
}
Add this to the existing policy in instance profile IAM role acc_b_instance_profile:
{
"Sid": "AssumeManualRoleAccA",
"Effect": "Allow",
"Action": "sts:AssumeRole",
"Resource": "arn:aws:iam::<account_a_id>:role/acc_a_autol_manual_create_role"
}
Run and test the Autoloader code from a notebook in Databricks Workspace:
options = {
"cloudFiles.format": "parquet",
"cloudFiles.schemaLocation": "s3://acc-a-autol-input/schema/manual/",
"cloudFiles.useNotifications": True,
"cloudFiles.roleArn": "arn:aws:iam::<account_a_id>:role/acc_a_autol_manual_create_role",
"cloudFiles.queueUrl": "https://sqs.<aws_region>.amazonaws.com/<account_a_id>/acc_a_autol_sqs"
}
df = (
spark.readStream.format("cloudFiles")
.options(**options)
.load("s3://acc-a-autol-input/data/manual/")
)
(
df.writeStream.option("checkpointLocation", "s3://acc-a-autol-input/checkpoint/manual1/1/")
.start("s3://acc-a-autol-input/output/manual/")
)
By following these steps, you should be able to successfully set up a cross-account Autoloader connection with Databricks.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.