cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Reading from an S3 bucket using boto3 on serverless cluster

petitregny
New Contributor II

Hello All,

I am trying to read a CSV file from my S3 bucket in a notebook running on serverless.

I am using the two standard functions below, but I get a credentials error (Error reading CSV from S3: Unable to locate credentials).

I don't have this issue when running exactly the same code on a personal compute, which has the appropriate AWS access role attached to the compute. Using spark.read.csv() aslo works on serverless, but I would like to be able to use boto3 with serverless.

Is there a way to get this to work?

Thank you!

How can I access

def create_s3_client(key_id, access_key, region😞

return boto3.client(

's3',

aws_access_key_id=key_id,

aws_secret_access_key=access_key,

region_name=region

)

def read_csv_from_s3(client, bucket_name, file_key😞

try:

response = client.get_object(Bucket=bucket_name, Key=file_key)

return pd.read_csv(response['Body'])

except Exception as e:

print(f"Error reading CSV from S3: {e}")

return None

poi_data = read_csv_from_s3(s3_client, aws_bucket_name, poi_location)
 
3 REPLIES 3

cgrant
Databricks Employee
Databricks Employee

For use cases where you want to use cloud service credentials to authenticate to cloud services, I recommend using Unity Catalog Service Credentials. These work with serverless and class compute in Databricks.

You'd create a service credential, and then refer to it in your code like this:

import boto3
credential = dbutils.credentials.getServiceCredentialsProvider('your-service-credential')
boto3_session = boto3.Session(botocore_session=credential, region_name='your-aws-region')
sm = boto3_session.client('secretsmanager')
sm.get_secret_value...

Isi
Contributor III

Hi @petitregny ,

The issue you’re encountering is likely due to the access mode of your cluster. Serverless compute uses standard/shared access mode, which does not allow you to directly access AWS credentials (such as the instance profile) in the same way as single-user/dedicated access mode.

That’s why your code works on a personal compute (with dedicated access mode and instance profile properly attached), but fails on serverless, the credentials are not directly available in the environment.

You can read more in the Databricks documentation:

“Because serverless compute for workflows uses standard access mode, your workloads must support this access mode.”

If you really need to use boto3 in this context, you have a few options:

  1. Use Databricks Secrets:

    Store your AWS access key and secret in a secret scope and load them in your notebook. This isn’t the cleanest approach, but it avoids complex configuration and works in most cases.

  2. Use Service Credentials with Unity Catalog:

    This is a more robust and secure solution, but it does require some architectural setup, including creating a Service Principal, assigning the correct permissions in Unity Catalog, and configuring cross-account IAM roles in AWS. If you’re not familiar with these concepts, it may feel a bit heavy at first.

  3. Stick with spark.read.csv() if possible:

     


    Since it works under the hood with Databricks’ credentials delegation and accesses S3 through an External Location, it’s the most compatible and secure way to read data from S3 in serverless environments.

Hope this helps 🙂

Isi

petitregny
New Contributor II

Thank you Isi, I will try with your suggestions.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now