Databricks Community

Madhawa · ‎05-20-2024

Reading file like this "Data = spark.sql("SELECT * FROM edge.inv.rm")

Getting this error

org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 441.0 failed 4 times, most recent failure: Lost task 10.3 in stage 441.0 (TID 204) (XX.XX.X.XX executor 0): com.databricks.sql.io.FileReadException: Error while reading file s3://edge-dataproducts-s3/inv/RM/part-002-c7e-12-8d9-fa1.c0.snappy.parquet.

Caused by: java.nio.file.AccessDeniedException: s3a://edge-dataproducts-s3/inv/RM/part-002-c7e-12-8d9-fa1.c0.snappy.parquet

Tried different clusters types, spark run time versions, still getting the same error.

Any suggestions to solve this error ?

Kaniz_Fatma · ‎05-20-2024

Hi @Madhawa,

Ensure that the AWS credentials (access key and secret key) are correctly configured in your Spark application. You can set them using spark.conf.set("spark.hadoop.fs.s3a.access.key", "your_access_key") and spark.conf.set("spark.hadoop.fs.s3a.secret.key", "your_secret_key").
Verify that the IAM user associated with these credentials has the necessary permissions to read from the S3 bucket.
Set the S3 endpoint correctly using spark.conf.set("spark.hadoop.fs.s3a.endpoint", "your_s3_endpoint"). Replace "your_s3_endpoint" with the actual endpoint for your S3 region (e.g., "s3.amazonaws.com").
Make sure the region matches the S3 bucket’s region.
Double-check the file path: s3a://edge-dataproducts-s3/inv/RM/part-002-c7e-12-8d9-fa1.c0.snappy.parquet. Ensure that the file exists in the specified location.
Verify that the bucket name, folder structure, and file name are accurate.
If you’re using an EMR cluster, ensure that the cluster’s security group allows outbound traffic to S3.
Check if there are any network-related issues (firewalls, VPC settings, etc.).
Consider upgrading Spark to the latest version (if possible) to benefit from bug fixes and improvements related to S3 interactions.

If you’ve tried all these steps and still face issues, please provide additional details, and we’ll continue troubleshooting!

For more information, you can refer to this Stack Overflow thread .

View solution in original post

Kaniz_Fatma · ‎05-20-2024

Hi @Madhawa,

Ensure that the AWS credentials (access key and secret key) are correctly configured in your Spark application. You can set them using spark.conf.set("spark.hadoop.fs.s3a.access.key", "your_access_key") and spark.conf.set("spark.hadoop.fs.s3a.secret.key", "your_secret_key").
Verify that the IAM user associated with these credentials has the necessary permissions to read from the S3 bucket.
Set the S3 endpoint correctly using spark.conf.set("spark.hadoop.fs.s3a.endpoint", "your_s3_endpoint"). Replace "your_s3_endpoint" with the actual endpoint for your S3 region (e.g., "s3.amazonaws.com").
Make sure the region matches the S3 bucket’s region.
Double-check the file path: s3a://edge-dataproducts-s3/inv/RM/part-002-c7e-12-8d9-fa1.c0.snappy.parquet. Ensure that the file exists in the specified location.
Verify that the bucket name, folder structure, and file name are accurate.
If you’re using an EMR cluster, ensure that the cluster’s security group allows outbound traffic to S3.
Check if there are any network-related issues (firewalls, VPC settings, etc.).
Consider upgrading Spark to the latest version (if possible) to benefit from bug fixes and improvements related to S3 interactions.

If you’ve tried all these steps and still face issues, please provide additional details, and we’ll continue troubleshooting!

For more information, you can refer to this Stack Overflow thread .

Madhawa · ‎05-20-2024

My concern is that sometimes I am able to run it without any errors, and other times I get this error. Why is that?

Databricks Community

Unable to access AWS S3 - Error : java.nio.file.AccessDeniedException

🔔 ALERT: Act Now to Protect Your Community Account; Secure Your Details Before It's Too Late!

Databricks Learning Festival (Virtual): 10 July - 24 July 2024

Data + AI Summit 2024: An Executive Summary for Data Leaders

Big Data Is Back and Is More Important Than AI