I manage a large data lake of Iceberg tables stored on premise in S3 storage from MinIO. I need a Spark cluster to run ETL jobs. I decided to try Databricks as there were no other good options. However, I'm unable to properly access my tables or even raw files. Databricks is assuming and trying to connect as if it's AWS. Hence, the AWS style path access. I explicitly configured it to use S3A style access, but still it's not able to understand and fetch the file correctly. I have pasted some code snippets and error below. Any input on what I'm missing, or if anyone has previously connected to Databricks to non-AWS S3 storage, or is it not at all possible?
BUCKET = "s3a://test/file1"
conf.setAll([
("spark.hadoop.fs.s3a.endpoint", AWS_ENDPOINT),
("spark.hadoop.fs.s3a.path.style.access", "true"),
("spark.hadoop.fs.s3a.access.key", AWS_ACCESS_KEY),
("spark.hadoop.fs.s3a.secret.key", AWS_SECRET_KEY),
("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"),
("spark.hadoop.fs.s3a.impl.disable.cache", "false"),
("spark.hadoop.fs.s3a.connection.ssl.enabled", "true"),
("spark.hadoop.hadoop.rpc.protection", "privacy")
])
Py4JJavaError: An error occurred while calling o458.parquet.
: java.nio.file.AccessDeniedException: s3a://test/file1/fp.parquet: getFileStatus on s3a://test/file1/fp.parquet: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; request: HEAD https://test.s3.us-west-2.amazonaws.com file1/fp.parquet {} Hadoop 3.3.6, aws-sdk-java/1.12.638 Linux/5.15.0-1078-azure OpenJDK_64-Bit_Server_VM/17.0.11+9-LTS java/17.0.11 scala/2.12.15 kotlin/1.9.10 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.GetObjectMetadataRequest; credentials-provider: com.amazonaws.auth.AnonymousAWSCredentials credential-header: no-credential-header signature-present: false (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden;;
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; request: HEAD
File , line 3
----> 3 df = spark.read.parquet(s3_path)