Assuming you have these pre-requisites:
A private S3 bucket (e.g., s3://my-private-bucket/data/file.csv)
An IAM user or role with access (list/get) to that bucket
The AWS Access Key ID and Secret Access Key (client and secret)
The most straightforward way, for testing and checking that connection works, could be this one by using a notebook:
Set keys in spark directly:
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", "<YOUR_AWS_ACCESS_KEY_ID>")
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "<YOUR_AWS_SECRET_ACCESS_KEY>")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
Then read your file into a dataframe:
df = spark.read.option("header", "true").csv("s3a://my-private-bucket/data/file.csv")
display(df)
โ
For non-testing scenarios you can store secrets in Databricks secrets or some Key Vault and retrieve from here whenever you need. However, best option would be using AWS IAM policies to attach proper roles to clusters in order to access data directly without specifying credentials.
Take a look at this: https://docs.databricks.com/aws/en/connect/storage/tutorial-s3-instance-profile