Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-05-2024 11:59 AM - edited 10-05-2024 12:01 PM
Hi @jerickson ,
I have tested this on Databricks Runtime 14.3 LTS:
- Install on the cluster the below Maven packages:
- com.amazonaws:aws-java-sdk-bundle:1.12.262
- org.apache.hadoop:hadoop-aws:3.3.4
- Run the below code to read your csv file into dataframe:
spark.conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
bucket_name = 'pmc-oa-opendata'
csv_key = 'author_manuscript/txt/metadata/csv/author_manuscript.filelist.csv'
csv_s3_uri = f's3a://{bucket_name}/{csv_key}'
df = spark.read.csv(csv_s3_uri, header=True, inferSchema=True)
Display the 5 first records:
# Display the 5 first records
df.show(n=5, truncate=False)
- Run df.count to show file count: