Databricks Community

Neli · ‎07-31-2024

We have a usecase where table has 15K rows , one of the column has S3 location. We need to read each row from table and fetch s3 location from one of the column,read its content from s3. To read the content from S3 , workflow is taking lot of time, tried with 96Gb cluster. We tried with both options Boto3 and dbutils.fs.head , both taking around 30 mins. Any better suggestion/solution available.

Kannathasan · ‎08-01-2024

Create IAM role in AWS S3 and use those credentials to connect to Databricks by using the below code

AWS_SECRET_ACCESS_KEY={{secrets/scope/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/scope/aws_access_key_id}}

aws_bucket_name = "my-s3-bucket"

df = spark.read.load(f"/aws_bucket_name /s3path/")
display(df)

View solution in original post

Kannathasan · ‎08-01-2024

Create IAM role in AWS S3 and use those credentials to connect to Databricks by using the below code

AWS_SECRET_ACCESS_KEY={{secrets/scope/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/scope/aws_access_key_id}}

aws_bucket_name = "my-s3-bucket"

df = spark.read.load(f"/aws_bucket_name /s3path/")
display(df)