cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Loading CSV from private S3 bucket

intelliconnectq
New Contributor II

Trying to load a csv file from a private S3 bucket

please clarify requirements to do this- Can I do it in community edition (if yes then how)? How to do it in premium version?

I have IAM role and I also access key & secret

 

2 ACCEPTED SOLUTIONS

Accepted Solutions

Coffee77
Contributor III

Assuming you have these pre-requisites: 

  • A private S3 bucket (e.g., s3://my-private-bucket/data/file.csv)

  • An IAM user or role with access (list/get) to that bucket

  • The AWS Access Key ID and Secret Access Key (client and secret)

The most straightforward way, for testing and checking that connection works, could be this one by using a notebook:

Set keys in spark directly:

spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<YOUR_AWS_ACCESS_KEY_ID>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<YOUR_AWS_SECRET_ACCESS_KEY>")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")

Then read your file into a dataframe:

df = spark.read.option("header", "true").csv("s3a://my-private-bucket/data/file.csv")
display(df)
โ€‹

For non-testing scenarios you can store secrets in Databricks secrets or some Key Vault and retrieve from here whenever you need. However, best option would be using AWS IAM policies to attach proper roles to clusters in order to access data directly without specifying credentials.

Take a look at this: https://docs.databricks.com/aws/en/connect/storage/tutorial-s3-instance-profile


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

View solution in original post

2 REPLIES 2

Coffee77
Contributor III

Assuming you have these pre-requisites: 

  • A private S3 bucket (e.g., s3://my-private-bucket/data/file.csv)

  • An IAM user or role with access (list/get) to that bucket

  • The AWS Access Key ID and Secret Access Key (client and secret)

The most straightforward way, for testing and checking that connection works, could be this one by using a notebook:

Set keys in spark directly:

spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<YOUR_AWS_ACCESS_KEY_ID>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<YOUR_AWS_SECRET_ACCESS_KEY>")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")

Then read your file into a dataframe:

df = spark.read.option("header", "true").csv("s3a://my-private-bucket/data/file.csv")
display(df)
โ€‹

For non-testing scenarios you can store secrets in Databricks secrets or some Key Vault and retrieve from here whenever you need. However, best option would be using AWS IAM policies to attach proper roles to clusters in order to access data directly without specifying credentials.

Take a look at this: https://docs.databricks.com/aws/en/connect/storage/tutorial-s3-instance-profile


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

Thank you

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now