cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Preferred way to read S3 - dbutils or Boto3 or better solution ?

Neli
New Contributor III

We have a usecase where table has 15K rows , one of the column has S3 location. We need to read each row from table and fetch s3 location from one of the column,read  its content from s3. To read the content from S3 , workflow is taking lot of time, tried with 96Gb cluster. We tried with both options Boto3 and dbutils.fs.head , both taking around 30 mins. Any better suggestion/solution available. 

1 ACCEPTED SOLUTION

Accepted Solutions

Kannathasan
New Contributor III

Create IAM role in AWS S3 and use those credentials to connect to Databricks by using the below code

AWS_SECRET_ACCESS_KEY={{secrets/scope/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/scope/aws_access_key_id}}

aws_bucket_name = "my-s3-bucket"

df = spark.read.load(f"/aws_bucket_name /s3path/")
display(df)

View solution in original post

1 REPLY 1

Kannathasan
New Contributor III

Create IAM role in AWS S3 and use those credentials to connect to Databricks by using the below code

AWS_SECRET_ACCESS_KEY={{secrets/scope/aws_secret_access_key}}
AWS_ACCESS_KEY_ID={{secrets/scope/aws_access_key_id}}

aws_bucket_name = "my-s3-bucket"

df = spark.read.load(f"/aws_bucket_name /s3path/")
display(df)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group