filipniziol
Esteemed Contributor

Hi @jerickson ,
I have tested this on Databricks Runtime 14.3 LTS:

  • Install on the cluster the below Maven packages:
    1. com.amazonaws:aws-java-sdk-bundle:1.12.262
    2. org.apache.hadoop:hadoop-aws:3.3.4
  • Run the below code to read your csv file into dataframe:

 

spark.conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")

bucket_name = 'pmc-oa-opendata'
csv_key = 'author_manuscript/txt/metadata/csv/author_manuscript.filelist.csv'

csv_s3_uri = f's3a://{bucket_name}/{csv_key}'

df = spark.read.csv(csv_s3_uri, header=True, inferSchema=True)

 

  • Display the 5 first records:

 

# Display the 5 first records
df.show(n=5, truncate=False)​

 

filipniziol_1-1728154365164.png

  • Run df.count to show file count:

 filipniziol_0-1728154322135.png