I can read all of my s3 data without any issues after configuring my cluster with an instance profile however when I try to run the following dlt decorator it gives me an access denied error. Are there some other IAM tweaks I need to make for delta? When looking at the pipeline, it looks like it fails at setting up tables in s3 after the initial read. Note that I also tried to set my storage location to a path in s3 both with s3a:// and /mnt syntax with no luck either. I also noticed that if I set storage to my bucket it hangs on waiting for resources before failing with `DataPlaneException: Failed to start the DLT service on cluster`. Ultimately I would use this with autoloader and cloudFiles but this is a simplified test which should work anyway -- thanks
#this gives me a 403 java.nio.file.AccessDeniedException to the s3 location
import dlt
from pyspark.sql.functions import explode, col
@dlt.table
def rtb_dlt_bids_bronze():
return (
spark.read.format("json")
.option("multiLine", "true")
.option("inferSchema", "true")
.load(/mnt/demo/<pathtofile>))
on the other hand this works fine:
display(spark.read.format("json")
.option("multiLine", "true")
.option("inferSchema", "true")
.load("/mnt/demo/<pathtofile>"))
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o772.load.
: java.nio.file.AccessDeniedException: s3a://<pathtofile>: getFileStatus on s3a://<pathtofile>: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden; request: HEAD https://<pathtofile>; {} Hadoop 3.3.1, aws-sdk-java/1.12.189 Linux/5.4.0-1075-aws OpenJDK_64-Bit_Server_VM/25.302-b08 java/1.8.0_302 scala/2.12.14 vendor/Azul_Systems,_Inc. cfg/retry-mode/legacy com.amazonaws.services.s3.model.GetObjectMetadataRequest