cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to connect to an on-premise implementation of S3 storage (such as Minio) in Databricks Notebooks

pg289
New Contributor II

I manage a large data lake of Iceberg tables stored on premise in S3 storage from MinIO. I need a Spark cluster to run ETL jobs. I decided to try Databricks as there were no other good options. However, I'm unable to properly access my tables or even raw files. Databricks is assuming and trying to connect as if it's AWS. Hence, the AWS style path access. I explicitly configured it to use S3A style access, but still it's not able to understand and fetch the file correctly. I have pasted some code snippets and error below. Any input on what I'm missing, or if anyone has previously connected to Databricks to non-AWS S3 storage, or is it not at all possible?

BUCKET = "s3a://test/file1"
conf.setAll([
    ("spark.hadoop.fs.s3a.endpoint", AWS_ENDPOINT),
    ("spark.hadoop.fs.s3a.path.style.access", "true"),
    ("spark.hadoop.fs.s3a.access.key", AWS_ACCESS_KEY),
    ("spark.hadoop.fs.s3a.secret.key", AWS_SECRET_KEY),
    ("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"),
    ("spark.hadoop.fs.s3a.impl.disable.cache", "false"),
    ("spark.hadoop.fs.s3a.connection.ssl.enabled", "true"),
    ("spark.hadoop.hadoop.rpc.protection", "privacy")
])

Py4JJavaError: An error occurred while calling o458.parquet.
: java.nio.file.AccessDeniedException: s3a://test/file1/fp.parquet: getFileStatus on s3a://test/file1/fp.parquet: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; request: HEAD https://test.s3.us-west-2.amazonaws.com file1/fp.parquet {} Hadoop 3.3.6, aws-sdk-java/1.12.638 Linux/5.15.0-1078-azure OpenJDK_64-Bit_Server_VM/17.0.11+9-LTS java/17.0.11 scala/2.12.15 kotlin/1.9.10 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.GetObjectMetadataRequest; credentials-provider: com.amazonaws.auth.AnonymousAWSCredentials credential-header: no-credential-header signature-present: false (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden;;
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; request: HEAD 
File , line 3
----> 3 df = spark.read.parquet(s3_path)

 

1 REPLY 1

SP_6721
New Contributor

Not sure, but Databricks may default to AWS-style paths if the configurations are incomplete. Try setting the MinIO endpoint by configuring spark.hadoop.fs.s3a.endpoint to your MinIO server's URL. If MinIO uses HTTP, disable SSL by setting spark.hadoop.fs.s3a.connection.ssl.enabled to false

It's 403 Forbidden error so check bucket permissions, endpoint settings, and ensure no conflicting AWS configurations are overriding your settings.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now