a week ago
Trying to load a csv file from a private S3 bucket
please clarify requirements to do this- Can I do it in community edition (if yes then how)? How to do it in premium version?
I have IAM role and I also access key & secret
a week ago - last edited a week ago
Assuming you have these pre-requisites:
A private S3 bucket (e.g., s3://my-private-bucket/data/file.csv)
An IAM user or role with access (list/get) to that bucket
The AWS Access Key ID and Secret Access Key (client and secret)
The most straightforward way, for testing and checking that connection works, could be this one by using a notebook:
Set keys in spark directly:
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<YOUR_AWS_ACCESS_KEY_ID>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<YOUR_AWS_SECRET_ACCESS_KEY>")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")Then read your file into a dataframe:
df = spark.read.option("header", "true").csv("s3a://my-private-bucket/data/file.csv")
display(df)
โFor non-testing scenarios you can store secrets in Databricks secrets or some Key Vault and retrieve from here whenever you need. However, best option would be using AWS IAM policies to attach proper roles to clusters in order to access data directly without specifying credentials.
Take a look at this: https://docs.databricks.com/aws/en/connect/storage/tutorial-s3-instance-profile
Thursday
a week ago - last edited a week ago
Assuming you have these pre-requisites:
A private S3 bucket (e.g., s3://my-private-bucket/data/file.csv)
An IAM user or role with access (list/get) to that bucket
The AWS Access Key ID and Secret Access Key (client and secret)
The most straightforward way, for testing and checking that connection works, could be this one by using a notebook:
Set keys in spark directly:
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<YOUR_AWS_ACCESS_KEY_ID>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<YOUR_AWS_SECRET_ACCESS_KEY>")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")Then read your file into a dataframe:
df = spark.read.option("header", "true").csv("s3a://my-private-bucket/data/file.csv")
display(df)
โFor non-testing scenarios you can store secrets in Databricks secrets or some Key Vault and retrieve from here whenever you need. However, best option would be using AWS IAM policies to attach proper roles to clusters in order to access data directly without specifying credentials.
Take a look at this: https://docs.databricks.com/aws/en/connect/storage/tutorial-s3-instance-profile
Thursday
Thank you
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now