Databricks Community

pg289 · ‎03-03-2025

I manage a large data lake of Iceberg tables stored on premise in S3 storage from MinIO. I need a Spark cluster to run ETL jobs. I decided to try Databricks as there were no other good options. However, I'm unable to properly access my tables or even raw files. Databricks is assuming and trying to connect as if it's AWS. Hence, the AWS style path access. I explicitly configured it to use S3A style access, but still it's not able to understand and fetch the file correctly. I have pasted some code snippets and error below. Any input on what I'm missing, or if anyone has previously connected to Databricks to non-AWS S3 storage, or is it not at all possible?

BUCKET = "s3a://test/file1"
conf.setAll([
    ("spark.hadoop.fs.s3a.endpoint", AWS_ENDPOINT),
    ("spark.hadoop.fs.s3a.path.style.access", "true"),
    ("spark.hadoop.fs.s3a.access.key", AWS_ACCESS_KEY),
    ("spark.hadoop.fs.s3a.secret.key", AWS_SECRET_KEY),
    ("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"),
    ("spark.hadoop.fs.s3a.impl.disable.cache", "false"),
    ("spark.hadoop.fs.s3a.connection.ssl.enabled", "true"),
    ("spark.hadoop.hadoop.rpc.protection", "privacy")
])

Py4JJavaError: An error occurred while calling o458.parquet.
: java.nio.file.AccessDeniedException: s3a://test/file1/fp.parquet: getFileStatus on s3a://test/file1/fp.parquet: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; request: HEAD https://test.s3.us-west-2.amazonaws.com file1/fp.parquet {} Hadoop 3.3.6, aws-sdk-java/1.12.638 Linux/5.15.0-1078-azure OpenJDK_64-Bit_Server_VM/17.0.11+9-LTS java/17.0.11 scala/2.12.15 kotlin/1.9.10 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.GetObjectMetadataRequest; credentials-provider: com.amazonaws.auth.AnonymousAWSCredentials credential-header: no-credential-header signature-present: false (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden;;
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; request: HEAD 
File , line 3
----> 3 df = spark.read.parquet(s3_path)

SP_6721 · Wednesday

Not sure, but Databricks may default to AWS-style paths if the configurations are incomplete. Try setting the MinIO endpoint by configuring spark.hadoop.fs.s3a.endpoint to your MinIO server's URL. If MinIO uses HTTP, disable SSL by setting spark.hadoop.fs.s3a.connection.ssl.enabled to false.

It's 403 Forbidden error so check bucket permissions, endpoint settings, and ensure no conflicting AWS configurations are overriding your settings.