topic Re: Preferred way to read S3 - dbutils or Boto3 or better solution ? in Data Engineering

Preferred way to read S3 - dbutils or Boto3 or better solution ?

Neli — Thu, 01 Aug 2024 01:48:56 GMT

We have a usecase where table has 15K rows , one of the column has S3 location. We need to read each row from table and fetch s3 location from one of the column,read its content from s3. To read the content from S3 , workflow is taking lot of time, tried with 96Gb cluster. We tried with both options Boto3 and dbutils.fs.head , both taking around 30 mins. Any better suggestion/solution available.

Re: Preferred way to read S3 - dbutils or Boto3 or better solution ?

Kannathasan — Thu, 01 Aug 2024 08:18:40 GMT

Create IAM role in AWS S3 and use those credentials to connect to Databricks by using the below code

AWS_SECRET_ACCESS_KEY={{secrets/scope/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/scope/aws_access_key_id}}

aws_bucket_name = "my-s3-bucket"

df = spark.read.load(f"/aws_bucket_name /s3path/")
display(df)