cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

It's not going well to Connect to Amazon S3 with using Spark

Yuki
Contributor

I can't Connect to Amazon S3 well.
I'm referencing and following this document: https://docs.databricks.com/gcp/en/connect/storage/amazon-s3

But I can't access the S3 well.

I believe the credentials are correct because I have verified that I can access S3 via boto3.

However, I'm using instance profile to access other S3s. 
Could this be the cause?

Thank you.

2 REPLIES 2

Isi
Contributor III

Hey @Yuki ,

If youโ€™re using instance profiles to access S3, make sure your cluster is running in โ€œSingle Userโ€(or Dedicated) access mode. Instance profiles wonโ€™t work with Shared(or Standard) or No Isolation clusters, especially if youโ€™re trying to access S3 from Unity Catalog or within notebooks.

 

You can check this by going to your cluster settings and verifying that itโ€™s configured Access Mode: Single User/Dedicated, and that the correct user is assigned (the one mapped to the instance profile, either directly or via group policies).

If this doesn't solve your problem, please post the cluster configuration instance profile json and error ๐Ÿ™‚

Hope this helps ๐Ÿ™‚

Isi

Yuki
Contributor

Hi Isi,

Thank you for your response โ€” I really appreciate it ๐Ÿ˜€

Apologies, I didnโ€™t explain my concern clearly.

What Iโ€™m trying to confirm may be whether the instance profile overrides the spark.conf settings defined in a notebook.

For example, I want to access csv on S3 using the following code:

```python

# gloabal lebel
spark.conf.set("spark.hadoop.fs.s3a.endpoint", "s3.amazonaws.com")
spark.conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
spark.conf.set('spark.hadoop.fs.s3a.server-side-encryption-algorithm', 'SSE-KMS')

# Set credentials using Databricks secrets (after SparkSession is created)
spark.conf.set(f'spark.hadoop.fs.s3a.bucket.{source_bucket}.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')
spark.conf.set(f"spark.hadoop.fs.s3a.bucket.{source_bucket}.endpoint", "s3.ap-northeast-1.amazonaws.com")
spark.conf.set(f"spark.hadoop.fs.s3a.bucket.{source_bucket}.access.key", source_access_key)
spark.conf.set(f"spark.hadoop.fs.s3a.bucket.{source_bucket}.secret.key", source_secret_key)
spark.conf.set(f"spark.hadoop.fs.s3a.bucket.{source_bucket}.region", source_region)
 
df = spark.read.option("header", True).csv(source_path)

```

 I can access S3 via boto3, but I can't access it...
The error message is like below.
`: java.nio.file.AccessDeniedException: s3a://<source_bucket>/foo.csv: getFileStatus on s3a://<source_bucket>/foo.csv: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; request: HEAD`

I doubt that the issue is caused by the instance profile overwriting credentials. I apologize if my hypothesis caused any misunderstanding of the current status.

Finally, my cluster is Dedicated mode now, thank you for your advice again.

Thank you.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now