Re: CREATE EXTERNAL LOCATION on a publicly availab...

filipniziol · ‎10-05-2024

Hi @jerickson ,
I have tested this on Databricks Runtime 14.3 LTS:

Install on the cluster the below Maven packages:
1. com.amazonaws:aws-java-sdk-bundle:1.12.262
2. org.apache.hadoop:hadoop-aws:3.3.4
Run the below code to read your csv file into dataframe:

spark.conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")

bucket_name = 'pmc-oa-opendata'
csv_key = 'author_manuscript/txt/metadata/csv/author_manuscript.filelist.csv'

csv_s3_uri = f's3a://{bucket_name}/{csv_key}'

df = spark.read.csv(csv_s3_uri, header=True, inferSchema=True)

Display the 5 first records:

# Display the 5 first records
df.show(n=5, truncate=False)

Run df.count to show file count: