Iām encountering issues setting up Databricks AutoLoader in File Notification mode. The error seems to be related to UC access to the S3 bucket. I have tried running it on a single-node dedicated cluster but no luck.
Any guidance or assistance on resolving this issue would be greatly appreciated.
Documents referenced for setup:
Below is the error message received:
Py4JJavaError: An error occurred while calling o415.load.
: java.nio.file.AccessDeniedException: s3://bucket_name/folder: getFileStatus on s3://bucket_name/folder: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; request: HEAD https://bucket_name.s3.eu-central-1.amazonaws.com folder {} Hadoop 3.3.6, aws-sdk-java/1.12.390 Linux/5.15.0-1063-aws OpenJDK_64-Bit_Server_VM/25.392-b08 java/1.8.0_392 scala/2.12.15 kotlin/1.6.0 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.GetObjectMetadataRequest; Request ID: 7FH7VQPTTBFCER18, Extended Request ID: sRHyEyURC221EulMHsMHTxZzK0R1TabG9vPgPV2vl1GsWSSoYwuJxriQYTZxxTMgvJKmlFM/D4KH7x9SZU6pMGDU9Wojk+rYqX+MnajfxEQ=, Cloud Provider: AWS, Instance ID: i-0ff777fafb0f546c9 credentials-provider: com.amazonaws.auth.BasicSessionCredentials credential-header: AWS4-HMAC-SHA256 Credential=ASIA5X45VTLXYJ24XYPS/20240718/eu-central-1/s3/aws4_request signature-present: true (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 7FH7VQPTTBFCER18; S3 Extended Request ID: sRHyEyURC221EulMHsMHTxZzK0R1TabG9vPgPV2vl1GsWSSoYwuJxriQYTZxxTMgvJKmlFM/D4KH7x9SZU6pMGDU9Wojk+rYqX+MnajfxEQ=; Proxy: null), S3 Extended Request ID: sRHyEyURC221EulMHsMHTxZzK0R1TabG9vPgPV2vl1GsWSSoYwuJxriQYTZxxTMgvJKmlFM/D4KH7x9SZU6pMGDU9Wojk+rYqX+MnajfxEQ=:403 Forbidden
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:292)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:197)
Autoloader Script:
try:
(
spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", schema_path)
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
.option("cloudFiles.useNotifications", "true")
.option("cloudFiles.region", "eu-central-1")
.option(
"cloudFiles.queueUrl",
"https://sqs.eu-central-1.amazonaws.com/XXXXXX/databricks-auto-ingest-test",
)
.load(f"s3://{bucket_name}/{bucket_prefix}")
.writeStream.option("checkpointLocation", checkpoint_path)
.option("mergeSchema", "true")
.trigger(availableNow=True)
.toTable(f"{catalog_name}.{schema_name}.{delta_table_name}")
)
except Exception as e:
raise e
IAM Policy attached to IAM Role / Instance profile:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DatabricksAutoLoaderSetup",
"Effect": "Allow",
"Action": [
"s3:GetBucketNotification",
"s3:PutBucketNotification",
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:GetLifecycleConfiguration",
"s3:PutLifecycleConfiguration",
"sns:ListSubscriptionsByTopic",
"sns:GetTopicAttributes",
"sns:SetTopicAttributes",
"sns:CreateTopic",
"sns:TagResource",
"sns:Publish",
"sns:Subscribe",
"sqs:CreateQueue",
"sqs:DeleteMessage",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:GetQueueUrl",
"sqs:GetQueueAttributes",
"sqs:SetQueueAttributes",
"sqs:TagQueue",
"sqs:ChangeMessageVisibility"
],
"Resource": [
"arn:aws:s3:::bucket_name",
"arn:aws:sqs:eu-central-1:XXXXX:databricks-auto-ingest-*",
"arn:aws:sns:eu-central-1:XXXXX:databricks-auto-ingest-*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObjectAcl"
],
"Resource": [
"arn:aws:s3:::bucket_name/*"
]
},
{
"Sid": "DatabricksAutoLoaderList",
"Effect": "Allow",
"Action": [
"sqs:ListQueues",
"sqs:ListQueueTags",
"sns:ListTopics"
],
"Resource": "*"
},
{
"Sid": "DatabricksAutoLoaderTeardown",
"Effect": "Allow",
"Action": [
"sns:Unsubscribe",
"sns:DeleteTopic",
"sqs:DeleteQueue"
],
"Resource": [
"arn:aws:sqs:eu-central-1:XXXX:databricks-auto-ingest-*",
"arn:aws:sns:eu-central-1:XXXX:databricks-auto-ingest-*"
]
}
]
}