Re: CREATE EXTERNAL LOCATION on a publicly availab...

jerickson · ‎10-03-2024

I would like to create an external location on a publicly available S3 bucket, for which I don't have credentials. I get a syntax error unless I include credentials. Is there a way to do this?

CREATE EXTERNAL LOCATION public_bucket

URL 's3://public_bucket'

WITH (CREDENTIAL ?)

filipniziol · ‎10-03-2024

Based on the below documentation you will not be able to do so:
https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-external-locations

Storage credential has 1-many relationship with external location.
In other words external location must have a storage credential.

Also, this article on creating STORAGE CREDENTIALS mentions extra requirements, for example the S3 bucket must be in the same region as the workspaces you want to access the data from, naming cannot contain dots, etc.:

https://docs.databricks.com/en/connect/unity-catalog/storage-credentials.html

Also, it makes sense not to allow public S3 buckets, because you need to be kind of owner of the cloud storage location, so that you can grant privileges on that location as part of UC catalog permission management. If it is public, then you do not have any control of it.

jerickson · ‎10-04-2024

I am trying to come up with a low code/low cost data ingestion pattern for a publicly available dataset on S3 (https://www.ncbi.nlm.nih.gov/pmc/tools/pmcaws/ where there are hundreds of thousands of files in a folder, and more added daily.

There is a 'file list' csv file (possibly an S3 inventory?) that would probably be a better way to identify all of the documents than trying to do it via boto3 (which works but is limited to 1000 results and might be costly):

s3_client = boto3.client(

's3',

region_name='us-east-1', # Specify the region

config=Config(signature_version=UNSIGNED)

)

response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=folder_prefix)

I was hoping to create an external table on the file list csv (which is 600 MB) as opposed to trying to download it daily, but I think that might be optimistic.

Any other ideas are appreciated...

Thanks.

filipniziol · ‎10-05-2024

Hi @jerickson ,
I have tested this on Databricks Runtime 14.3 LTS:

Install on the cluster the below Maven packages:
1. com.amazonaws:aws-java-sdk-bundle:1.12.262
2. org.apache.hadoop:hadoop-aws:3.3.4
Run the below code to read your csv file into dataframe:

spark.conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")

bucket_name = 'pmc-oa-opendata'
csv_key = 'author_manuscript/txt/metadata/csv/author_manuscript.filelist.csv'

csv_s3_uri = f's3a://{bucket_name}/{csv_key}'

df = spark.read.csv(csv_s3_uri, header=True, inferSchema=True)

Display the 5 first records:

# Display the 5 first records
df.show(n=5, truncate=False)

Run df.count to show file count:

CREATE EXTERNAL LOCATION on a publicly available S3 bucket