<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic CREATE EXTERNAL LOCATION on a publicly available S3 bucket in Data Governance</title>
    <link>https://community.databricks.com/t5/data-governance/create-external-location-on-a-publicly-available-s3-bucket/m-p/92679#M2169</link>
    <description>&lt;P&gt;I would like to create an external location on a publicly available S3 bucket, for which I don't have credentials. I get a syntax error unless I include credentials. Is there a way to do this?&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;CREATE&lt;/SPAN&gt; &lt;SPAN&gt;EXTERNAL&lt;/SPAN&gt; &lt;SPAN&gt;LOCATION&lt;/SPAN&gt;&lt;SPAN&gt; public_bucket &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;URL&lt;/SPAN&gt; &lt;SPAN&gt;'s3://public_bucket'&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;WITH&lt;/SPAN&gt;&lt;SPAN&gt; (&lt;/SPAN&gt;&lt;SPAN&gt;CREDENTIAL&lt;/SPAN&gt;&lt;SPAN&gt; ?)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Thu, 03 Oct 2024 17:16:04 GMT</pubDate>
    <dc:creator>jerickson</dc:creator>
    <dc:date>2024-10-03T17:16:04Z</dc:date>
    <item>
      <title>CREATE EXTERNAL LOCATION on a publicly available S3 bucket</title>
      <link>https://community.databricks.com/t5/data-governance/create-external-location-on-a-publicly-available-s3-bucket/m-p/92679#M2169</link>
      <description>&lt;P&gt;I would like to create an external location on a publicly available S3 bucket, for which I don't have credentials. I get a syntax error unless I include credentials. Is there a way to do this?&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;CREATE&lt;/SPAN&gt; &lt;SPAN&gt;EXTERNAL&lt;/SPAN&gt; &lt;SPAN&gt;LOCATION&lt;/SPAN&gt;&lt;SPAN&gt; public_bucket &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;URL&lt;/SPAN&gt; &lt;SPAN&gt;'s3://public_bucket'&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;WITH&lt;/SPAN&gt;&lt;SPAN&gt; (&lt;/SPAN&gt;&lt;SPAN&gt;CREDENTIAL&lt;/SPAN&gt;&lt;SPAN&gt; ?)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 03 Oct 2024 17:16:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/create-external-location-on-a-publicly-available-s3-bucket/m-p/92679#M2169</guid>
      <dc:creator>jerickson</dc:creator>
      <dc:date>2024-10-03T17:16:04Z</dc:date>
    </item>
    <item>
      <title>Re: CREATE EXTERNAL LOCATION on a publicly available S3 bucket</title>
      <link>https://community.databricks.com/t5/data-governance/create-external-location-on-a-publicly-available-s3-bucket/m-p/92695#M2170</link>
      <description>&lt;P&gt;Based on the below documentation you will not be able to do so:&lt;BR /&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-external-locations" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-external-locations&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Storage credential has 1-many relationship with external location.&lt;BR /&gt;In other words external location must have a storage credential.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_0-1727983806008.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11678i0082AE6DBA2EEB84/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_0-1727983806008.png" alt="filipniziol_0-1727983806008.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Also, this article on creating STORAGE CREDENTIALS mentions extra requirements, for example the S3 bucket must be in the same region as the workspaces you want to access the data from, naming cannot contain dots, etc.:&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/connect/unity-catalog/storage-credentials.html" target="_blank"&gt;https://docs.databricks.com/en/connect/unity-catalog/storage-credentials.html&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;Also, it makes sense not to allow public S3 buckets, because you need to be kind of owner of the cloud storage location, so that you can grant privileges on that location as part of UC catalog permission management. If it is public, then you do not have any control of it.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Oct 2024 19:56:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/create-external-location-on-a-publicly-available-s3-bucket/m-p/92695#M2170</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-10-03T19:56:02Z</dc:date>
    </item>
    <item>
      <title>Re: CREATE EXTERNAL LOCATION on a publicly available S3 bucket</title>
      <link>https://community.databricks.com/t5/data-governance/create-external-location-on-a-publicly-available-s3-bucket/m-p/92752#M2171</link>
      <description>&lt;P&gt;I am trying to come up with a low code/low cost data ingestion pattern for a publicly available dataset on S3 (&lt;A href="https://www.ncbi.nlm.nih.gov/pmc/tools/pmcaws/" target="_blank"&gt;https://www.ncbi.nlm.nih.gov/pmc/tools/pmcaws/&lt;/A&gt; where there are hundreds of thousands of files in a folder, and more added daily.&lt;/P&gt;&lt;P&gt;There is a 'file list' csv file (possibly an S3 inventory?) that would probably be a better way to identify all of the documents than trying to do it via boto3 (which works but is limited to 1000 results and might be costly):&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;s3_client &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; boto3.&lt;/SPAN&gt;&lt;SPAN&gt;client&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;'s3'&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;region_name&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;'us-east-1'&lt;/SPAN&gt;&lt;SPAN&gt;, &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;# Specify the region&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;config&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;Config&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;signature_version&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;UNSIGNED)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;response &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; s3_client.&lt;/SPAN&gt;&lt;SPAN&gt;list_objects_v2&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;Bucket&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;bucket_name, &lt;/SPAN&gt;&lt;SPAN&gt;Prefix&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;folder_prefix)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;I was hoping to create an external table on the file list csv (which is 600 MB) as opposed to trying to download it daily, but I think that might be optimistic.&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Any other ideas are appreciated...&lt;/DIV&gt;&lt;DIV&gt;Thanks.&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 04 Oct 2024 10:26:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/create-external-location-on-a-publicly-available-s3-bucket/m-p/92752#M2171</guid>
      <dc:creator>jerickson</dc:creator>
      <dc:date>2024-10-04T10:26:00Z</dc:date>
    </item>
    <item>
      <title>Re: CREATE EXTERNAL LOCATION on a publicly available S3 bucket</title>
      <link>https://community.databricks.com/t5/data-governance/create-external-location-on-a-publicly-available-s3-bucket/m-p/92837#M2172</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/87200"&gt;@jerickson&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;I have tested this on Databricks Runtime 14.3 LTS:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Install on the cluster the below Maven packages:&lt;OL&gt;&lt;LI&gt;com.amazonaws:aws-java-sdk-bundle:1.12.262&lt;/LI&gt;&lt;LI&gt;org.apache.hadoop:hadoop-aws:3.3.4&lt;/LI&gt;&lt;/OL&gt;&lt;/LI&gt;&lt;LI&gt;Run the below code to read your csv file into dataframe:&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;spark.conf.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider")

bucket_name = 'pmc-oa-opendata'
csv_key = 'author_manuscript/txt/metadata/csv/author_manuscript.filelist.csv'

csv_s3_uri = f's3a://{bucket_name}/{csv_key}'

df = spark.read.csv(csv_s3_uri, header=True, inferSchema=True)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Display the 5 first records:&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# Display the 5 first records
df.show(n=5, truncate=False)​&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_1-1728154365164.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11705i3A95EA60BB866999/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_1-1728154365164.png" alt="filipniziol_1-1728154365164.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Run df.count to show file count&lt;SPAN&gt;&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class="lia-indent-padding-left-30px"&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_0-1728154322135.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11704iBD2C9E2A9026CE5C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_0-1728154322135.png" alt="filipniziol_0-1728154322135.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 05 Oct 2024 19:01:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/create-external-location-on-a-publicly-available-s3-bucket/m-p/92837#M2172</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-10-05T19:01:57Z</dc:date>
    </item>
  </channel>
</rss>

