cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Governance
Join discussions on data governance practices, compliance, and security within the Databricks Community. Exchange strategies and insights to ensure data integrity and regulatory compliance.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

CREATE EXTERNAL LOCATION on a publicly available S3 bucket

jerickson
New Contributor II

I would like to create an external location on a publicly available S3 bucket, for which I don't have credentials. I get a syntax error unless I include credentials. Is there a way to do this?

 
CREATE EXTERNAL LOCATION public_bucket
URL 's3://public_bucket'
WITH (CREDENTIAL ?)
3 REPLIES 3

filipniziol
Contributor III

Based on the below documentation you will not be able to do so:
https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-external-locations

Storage credential has 1-many relationship with external location.
In other words external location must have a storage credential.

filipniziol_0-1727983806008.png

Also, this article on creating STORAGE CREDENTIALS mentions extra requirements, for example the S3 bucket must be in the same region as the workspaces you want to access the data from, naming cannot contain dots, etc.:

https://docs.databricks.com/en/connect/unity-catalog/storage-credentials.html

Also, it makes sense not to allow public S3 buckets, because you need to be kind of owner of the cloud storage location, so that you can grant privileges on that location as part of UC catalog permission management. If it is public, then you do not have any control of it. 

jerickson
New Contributor II

I am trying to come up with a low code/low cost data ingestion pattern for a publicly available dataset on S3 (https://www.ncbi.nlm.nih.gov/pmc/tools/pmcaws/ where there are hundreds of thousands of files in a folder, and more added daily.

There is a 'file list' csv file (possibly an S3 inventory?) that would probably be a better way to identify all of the documents than trying to do it via boto3 (which works but is limited to 1000 results and might be costly):

 

s3_client = boto3.client(
    's3',
    region_name='us-east-1',  # Specify the region
    config=Config(signature_version=UNSIGNED)
)
response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=folder_prefix)
 
I was hoping to create an external table on the file list csv (which is 600 MB) as opposed to trying to download it daily, but I think that might be optimistic.
 
Any other ideas are appreciated...
Thanks.

filipniziol
Contributor III

Hi @jerickson ,
I have tested this on Databricks Runtime 14.3 LTS:

  • Install on the cluster the below Maven packages:
    1. com.amazonaws:aws-java-sdk-bundle:1.12.262
    2. org.apache.hadoop:hadoop-aws:3.3.4
  • Run the below code to read your csv file into dataframe:

 

spark.conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")

bucket_name = 'pmc-oa-opendata'
csv_key = 'author_manuscript/txt/metadata/csv/author_manuscript.filelist.csv'

csv_s3_uri = f's3a://{bucket_name}/{csv_key}'

df = spark.read.csv(csv_s3_uri, header=True, inferSchema=True)

 

  • Display the 5 first records:

 

# Display the 5 first records
df.show(n=5, truncate=False)โ€‹

 

filipniziol_1-1728154365164.png

  • Run df.count to show file count:

 filipniziol_0-1728154322135.png

 

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group