CREATE EXTERNAL LOCATION on a publicly available S3 bucket
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-03-2024 10:16 AM
I would like to create an external location on a publicly available S3 bucket, for which I don't have credentials. I get a syntax error unless I include credentials. Is there a way to do this?
- Labels:
-
Unity Catalog
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-03-2024 12:56 PM
Based on the below documentation you will not be able to do so:
https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-external-locations
Storage credential has 1-many relationship with external location.
In other words external location must have a storage credential.
Also, this article on creating STORAGE CREDENTIALS mentions extra requirements, for example the S3 bucket must be in the same region as the workspaces you want to access the data from, naming cannot contain dots, etc.:
https://docs.databricks.com/en/connect/unity-catalog/storage-credentials.html
Also, it makes sense not to allow public S3 buckets, because you need to be kind of owner of the cloud storage location, so that you can grant privileges on that location as part of UC catalog permission management. If it is public, then you do not have any control of it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-04-2024 03:26 AM
I am trying to come up with a low code/low cost data ingestion pattern for a publicly available dataset on S3 (https://www.ncbi.nlm.nih.gov/pmc/tools/pmcaws/ where there are hundreds of thousands of files in a folder, and more added daily.
There is a 'file list' csv file (possibly an S3 inventory?) that would probably be a better way to identify all of the documents than trying to do it via boto3 (which works but is limited to 1000 results and might be costly):
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-05-2024 11:59 AM - edited 10-05-2024 12:01 PM
Hi @jerickson ,
I have tested this on Databricks Runtime 14.3 LTS:
- Install on the cluster the below Maven packages:
- com.amazonaws:aws-java-sdk-bundle:1.12.262
- org.apache.hadoop:hadoop-aws:3.3.4
- Run the below code to read your csv file into dataframe:
spark.conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")
bucket_name = 'pmc-oa-opendata'
csv_key = 'author_manuscript/txt/metadata/csv/author_manuscript.filelist.csv'
csv_s3_uri = f's3a://{bucket_name}/{csv_key}'
df = spark.read.csv(csv_s3_uri, header=True, inferSchema=True)
Display the 5 first records:
# Display the 5 first records
df.show(n=5, truncate=False)
- Run df.count to show file count: