Databricks Community

intelliconnectq · a month ago

Trying to load a csv file from a private S3 bucket

please clarify requirements to do this- Can I do it in community edition (if yes then how)? How to do it in premium version?

I have IAM role and I also access key & secret

Coffee77 · a month ago

Assuming you have these pre-requisites:

A private S3 bucket (e.g., s3://my-private-bucket/data/file.csv)
An IAM user or role with access (list/get) to that bucket
The AWS Access Key ID and Secret Access Key (client and secret)

The most straightforward way, for testing and checking that connection works, could be this one by using a notebook:

Set keys in spark directly:

spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<YOUR_AWS_ACCESS_KEY_ID>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<YOUR_AWS_SECRET_ACCESS_KEY>")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")

Then read your file into a dataframe:

df = spark.read.option("header", "true").csv("s3a://my-private-bucket/data/file.csv")
display(df)

For non-testing scenarios you can store secrets in Databricks secrets or some Key Vault and retrieve from here whenever you need. However, best option would be using AWS IAM policies to attach proper roles to clusters in order to access data directly without specifying credentials.

Take a look at this: https://docs.databricks.com/aws/en/connect/storage/tutorial-s3-instance-profile

Lifelong Solution Architect Learner | Coffee & Data

View solution in original post

intelliconnectq · 4 weeks ago

Thank you