Hi @gabriel_lazo, Configuring Databricks to connect to an AWS S3 access point through a VPC while ensuring other Databricks workspaces cannot access it requires some careful setup.
Let’s break it down:
Instance Profiles for S3 Access:
- Recommended Approach: Use instance profiles to control data access to S3. You can load IAM roles as instance profiles in Databricks and attach them to clusters. This allows you to manage permissions effectively.
- The AWS user who creates the IAM role must have permissions to create or update IAM roles, IAM policies, S3 buckets, and cross-account trust relationships.
- The Databricks user who adds the IAM role as an instance profile in Databricks must be a workspace admin.
- Once added, you can grant users, groups, or service principals permissions to launch clusters with the instance profile.
- Protect access to the instance profile using both cluster access control and notebook access control....
Access S3 with URIs and AWS Keys:
- Set Spark properties to configure AWS keys for S3 access.
- Databricks recommends using secret scopes to store credentials securely.
- Create a secret scope and grant users access to read it.
- Set Spark properties in a cluster’s Spark configuration using the following snippet:AWS_SECRET_ACCESS_KEY={{secrets/scope/aws_secret_access_key}} AWS_ACCESS_KEY_ID={{secrets/scope/aws_access_key_id}}
- Read from S3 using commands like:aws_bucket_name = "my-s3-bucket" df = spark.read.load(f"s3a://{aws_bucket_name}/flowers/delta/") display(df) dbutils.fs.ls(f"s3a://{aws_bucket_name}/")
Open-Source Hadoop Options:
VPC Endpoints and S3 Access Points:
If you need further assistance, feel free to ask! 🚀