cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Read file from AWS S3 using Azure Databricks

Dnirmania
Contributor

Hi Team

I am currently working on a project to read CSV files from an AWS S3 bucket using an Azure Databricks notebook. My ultimate goal is to set up an autoloader in Azure Databricks that reads new files from S3 and loads the data incrementally. However, I am facing issues accessing the S3 bucket from the notebook. Despite creating a new user in AWS and granting it full permissions on the S3 bucket, I am still encountering the following message:

Dnirmania_0-1744106993274.png

Here is the code I build for auto loader:

import boto3
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("S3Access").getOrCreate()

# AWS credentials
access_key = AWS_ACCESS_KEY_ID
secret_key = AWS_SECRET_ACCESS_KEY

# Configure Spark to use AWS credentials
hadoop_conf = spark._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.access.key", access_key)
hadoop_conf.set("fs.s3a.secret.key", secret_key)
# hadoop_conf.set("fs.s3a.session.token", aws_session_token)
hadoop_conf.set("fs.s3a.endpoint", "s3.amazonaws.com")

# S3 bucket and file path
s3_bucket = "s3a://taxcom-autoloader/files/file1.csv"

# Read CSV file from S3
df = spark.read.csv(s3_bucket)

# Show the data
df.show()
 
If anyone could provide insights on this process, it would be greatly appreciated. Thank you for your help!
 
Thanks,
Dinesh Kumar
4 REPLIES 4

Brahmareddy
Honored Contributor III

Hi Dnirmania,

How are you doing today?, As per my understanding, you’re definitely on the right track, and it’s great that you’re connecting AWS S3 with Azure Databricks—it’s a useful setup but can be a bit tricky. From what you shared, the code looks mostly fine, but make sure your S3 bucket allows access from outside AWS—sometimes the bucket policy needs to be updated to allow cross-cloud access. Also, double-check that your fs.s3a.endpoint matches the region where your S3 bucket is located (for example, s3.eu-west-1.amazonaws.com if it's in Ireland). And for using Autoloader, instead of spark.read.csv(), switch to spark.readStream.format("cloudFiles") and provide .option("cloudFiles.format", "csv") with your S3 path—this will allow incremental loading of new files. Lastly, it’s a good idea to store your AWS credentials securely in Databricks secrets instead of hardcoding them. Let me know if you want help setting up the correct S3 bucket policy or configuring Autoloader fully!

Regards,

Brahma

Dnirmania
Contributor

Thank you, @Brahmareddy , for your response. I updated the code based on your suggestion, but I'm still encountering the same error message. I even made my S3 bucket public, but no luck. Interestingly, I was able to read a CSV file from the S3 bucket using boto3, but I still can't access S3 using Spark.

Nathant93
New Contributor III

I get the same issue, trying to access S3 Bucket from Azure Databricks. Keen to be able to read direct rather than through ADF.

Aviral-Bhardwaj
Esteemed Contributor III

no ,it is very easy follow this guide it will work - https://github.com/aviral-bhardwaj/MyPoCs/blob/main/SparkPOC/ETLProjectsAWS-S3toDatabricks.ipynb 

 

 

AviralBhardwaj

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now