Databricks

thethirtyfour · ‎03-06-2024

Hi,

I am trying to read a csv file into a Spark DataFrame using sparklyr::spark_read_csv. I am receiving a 403 access denied error.

I have stored my AWS credentials as environment variables, and can successfully read the file as an R dataframe using arrow::read_csv_arrow. However, spark_read_csv is failing.

I have confirmed that I am connected to spark, and can read parquet files stored elsewhere.

Any advice?

Thanks,

my_file <- glue::glue("s3://my-bucket/my-folder/my-file-name.csv")

## This works
mydata <- arrow::read_csv_arrow(
file = my_file
)
## This doesn't
mydata <- sparklyr::spark_read_csv(
sc,
name = "mydata"
file = my_file
)

# Error message
Error : java.nio.file.AccessDeniedException

Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; request

Kaniz · ‎03-07-2024

Hi @thethirtyfour, It seems you’re encountering a 403 Forbidden error when trying to read a CSV file into a Spark DataFrame using sparklyr::spark_read_csv.

Let’s troubleshoot this issue and explore potential solutions:

IAM Roles vs. AWS Keys:
- Ensure that you are using IAM roles instead of AWS keys for authentication. IAM roles provide more secure access to AWS services.
- If you’re switching from AWS keys to IAM roles, unmount any DBFS mount points for S3 buckets created using AWS keys and remount them using the IAM role.
- Avoid setting AWS keys globally; use a cluster-scoped init script if necessary ¹.
Check Permissions:
- Verify that the user running the notebook or script has the necessary permissions to access the S3 bucket.
- If you’re using Azure Synapse Analytics, consider adding the RBAC Storage Blob Data Contributor role to the user. You can do this after workspace creation ².
File Permissions:
- Ensure that the file permissions allow the user to read the CSV file. You can adjust permissions using chmod (Linux/macOS) or icacls (Windows).
- Move the file to a location where the user has access ³.
Cache Invalidation:
- Sometimes cached data can cause issues. Explicitly invalidate the cache in Spark by running the SQL command REFRESH TABLE tableName or by re...⁴.

Hopefully, one of these steps will help resolve the access denied issue. If you continue to encounter problems, feel free to ask for further assistance! 🚀

Databricks

sparklyr::spark_read_csv forbidden 403 error

Exciting Announcement: Introducing our Learning Library!

Databricks Community Social - May 2024

🔔 Attention Databricks Academy Users: SSO Implementation Incoming! Secure Your Account Today!

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications