topic It's not going well to Connect to Amazon S3 with using Spark in Data Engineering

It's not going well to Connect to Amazon S3 with using Spark

Yuki — Wed, 18 Jun 2025 23:28:12 GMT

I can't Connect to Amazon S3 well.
I'm referencing and following this document: https://docs.databricks.com/gcp/en/connect/storage/amazon-s3

But I can't access the S3 well.

I believe the credentials are correct because I have verified that I can access S3 via boto3.

However, I'm using instance profile to access other S3s.
Could this be the cause?

Thank you.

Re: It's not going well to Connect to Amazon S3 with using Spark

Isi — Sun, 22 Jun 2025 11:50:49 GMT

Hey @Yuki ,

If you’re using instance profiles to access S3, make sure your cluster is running in “Single User”(or Dedicated) access mode. Instance profiles won’t work with Shared(or Standard) or No Isolation clusters, especially if you’re trying to access S3 from Unity Catalog or within notebooks.

You can check this by going to your cluster settings and verifying that it’s configured Access Mode: Single User/Dedicated, and that the correct user is assigned (the one mapped to the instance profile, either directly or via group policies).

If this doesn't solve your problem, please post the cluster configuration instance profile json and error 🙂

Hope this helps 🙂

Isi

Re: It's not going well to Connect to Amazon S3 with using Spark

Yuki — Sun, 22 Jun 2025 23:31:44 GMT

Hi Isi,

Thank you for your response — I really appreciate it 😀

Apologies, I didn’t explain my concern clearly.

What I’m trying to confirm may be whether the instance profile overrides the spark.conf settings defined in a notebook.

For example, I want to access csv on S3 using the following code:

```python

# gloabal lebel

spark.conf.set("spark.hadoop.fs.s3a.endpoint", "s3.amazonaws.com")

spark.conf.set('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')

spark.conf.set('spark.hadoop.fs.s3a.server-side-encryption-algorithm', 'SSE-KMS')

# Set credentials using Databricks secrets (after SparkSession is created)

spark.conf.set(f'spark.hadoop.fs.s3a.bucket.{source_bucket}.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider')

spark.conf.set(f"spark.hadoop.fs.s3a.bucket.{source_bucket}.endpoint", "s3.ap-northeast-1.amazonaws.com")

spark.conf.set(f"spark.hadoop.fs.s3a.bucket.{source_bucket}.access.key", source_access_key)

spark.conf.set(f"spark.hadoop.fs.s3a.bucket.{source_bucket}.secret.key", source_secret_key)

spark.conf.set(f"spark.hadoop.fs.s3a.bucket.{source_bucket}.region", source_region)

df = spark.read.option("header", True).csv(source_path)

```

I can access S3 via boto3, but I can't access it...
The error message is like below.
`: java.nio.file.AccessDeniedException: s3a://<source_bucket>/foo.csv: getFileStatus on s3a://<source_bucket>/foo.csv: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; request: HEAD`

I doubt that the issue is caused by the instance profile overwriting credentials. I apologize if my hypothesis caused any misunderstanding of the current status.

Finally, my cluster is Dedicated mode now, thank you for your advice again.

Thank you.