Databricks Community

Dnirmania · ‎04-08-2025

Hi Team

I am currently working on a project to read CSV files from an AWS S3 bucket using an Azure Databricks notebook. My ultimate goal is to set up an autoloader in Azure Databricks that reads new files from S3 and loads the data incrementally. However, I am facing issues accessing the S3 bucket from the notebook. Despite creating a new user in AWS and granting it full permissions on the S3 bucket, I am still encountering the following message:

Here is the code I build for auto loader:

import boto3
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("S3Access").getOrCreate()

# AWS credentials
access_key = AWS_ACCESS_KEY_ID
secret_key = AWS_SECRET_ACCESS_KEY

# Configure Spark to use AWS credentials
hadoop_conf = spark._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.access.key", access_key)
hadoop_conf.set("fs.s3a.secret.key", secret_key)
# hadoop_conf.set("fs.s3a.session.token", aws_session_token)
hadoop_conf.set("fs.s3a.endpoint", "s3.amazonaws.com")

# S3 bucket and file path
s3_bucket = "s3a://taxcom-autoloader/files/file1.csv"

# Read CSV file from S3
df = spark.read.csv(s3_bucket)

# Show the data
df.show()

If anyone could provide insights on this process, it would be greatly appreciated. Thank you for your help!

Thanks,

Dinesh Kumar

Brahmareddy · ‎04-08-2025

Hi Dnirmania,

How are you doing today?, As per my understanding, you’re definitely on the right track, and it’s great that you’re connecting AWS S3 with Azure Databricks—it’s a useful setup but can be a bit tricky. From what you shared, the code looks mostly fine, but make sure your S3 bucket allows access from outside AWS—sometimes the bucket policy needs to be updated to allow cross-cloud access. Also, double-check that your fs.s3a.endpoint matches the region where your S3 bucket is located (for example, s3.eu-west-1.amazonaws.com if it's in Ireland). And for using Autoloader, instead of spark.read.csv(), switch to spark.readStream.format("cloudFiles") and provide .option("cloudFiles.format", "csv") with your S3 path—this will allow incremental loading of new files. Lastly, it’s a good idea to store your AWS credentials securely in Databricks secrets instead of hardcoding them. Let me know if you want help setting up the correct S3 bucket policy or configuring Autoloader fully!

Regards,

Brahma

Dnirmania · ‎04-09-2025

Thank you, @Brahmareddy , for your response. I updated the code based on your suggestion, but I'm still encountering the same error message. I even made my S3 bucket public, but no luck. Interestingly, I was able to read a CSV file from the S3 bucket using boto3, but I still can't access S3 using Spark.