<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: S3 access credentials: Pandas vs Spark in Administration &amp; Architecture</title>
    <link>https://community.databricks.com/t5/administration-architecture/s3-access-credentials-pandas-vs-spark/m-p/103574#M2624</link>
    <description>&lt;P&gt;Yes, Pandas uses a different access mechanism compared to Spark when accessing S3. While Spark can leverage the "external location" configuration in Databricks to access S3 without explicitly specifying credentials, Pandas requires explicit AWS credentials to access S3.&lt;BR /&gt;&lt;BR /&gt;You can configure credentials as follows:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Instance Profiles&lt;/STRONG&gt;: Attach an instance profile to your Databricks cluster that has the necessary permissions to access the S3 bucket. This way, the credentials are managed by AWS and are available to all libraries, including Pandas.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Databricks Secrets&lt;/STRONG&gt;: Store your AWS credentials in Databricks Secrets and configure your notebooks to use these secrets. This approach keeps your credentials secure and avoids hardcoding them in your notebooks.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 30 Dec 2024 15:25:34 GMT</pubDate>
    <dc:creator>Walter_C</dc:creator>
    <dc:date>2024-12-30T15:25:34Z</dc:date>
    <item>
      <title>S3 access credentials: Pandas vs Spark</title>
      <link>https://community.databricks.com/t5/administration-architecture/s3-access-credentials-pandas-vs-spark/m-p/103562#M2621</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I need to read Parquet files located in S3 into the Pandas dataframe.&lt;/P&gt;&lt;P&gt;I configured "external location" to access my S3 bucket and have&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;df = spark.read.parquet(s3_parquet_file_path)&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;working perfectly well.&lt;BR /&gt;&lt;BR /&gt;However,&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;df =&amp;nbsp;pd.read_parquet(s3_parquet_file_path)&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;fails with NoCredentials&amp;nbsp;&lt;SPAN class=""&gt;Error&lt;/SPAN&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;(it also requires&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN&gt;fsspec and s3fs.)&lt;BR /&gt;&lt;BR /&gt;What am I missing? Do I need to provision "credentials" in addition to "external location"?&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Regards&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Stas&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 30 Dec 2024 14:56:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/s3-access-credentials-pandas-vs-spark/m-p/103562#M2621</guid>
      <dc:creator>staskh</dc:creator>
      <dc:date>2024-12-30T14:56:41Z</dc:date>
    </item>
    <item>
      <title>Re: S3 access credentials: Pandas vs Spark</title>
      <link>https://community.databricks.com/t5/administration-architecture/s3-access-credentials-pandas-vs-spark/m-p/103564#M2622</link>
      <description>&lt;P&gt;May I know the exact error message being received?&lt;BR /&gt;&lt;BR /&gt;Can you confirm you have the following set:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;SPAN&gt;To read Parquet files from S3 into a Pandas DataFrame, you need to ensure that the necessary libraries (&lt;CODE&gt;fsspec&lt;/CODE&gt; and &lt;CODE&gt;s3fs&lt;/CODE&gt;) are installed and that the appropriate credentials are provided. Here are the steps you can follow:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Install the required libraries&lt;/STRONG&gt;:&lt;/P&gt;
&lt;DIV class="gb5fhw2"&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python _1t7bu9hb hljs language-python gb5fhw3"&gt;%pip install fsspec s3fs&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Provide AWS credentials&lt;/STRONG&gt;: You need to ensure that your AWS credentials are accessible to &lt;CODE&gt;s3fs&lt;/CODE&gt;.&lt;/P&gt;
&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Mon, 30 Dec 2024 14:59:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/s3-access-credentials-pandas-vs-spark/m-p/103564#M2622</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2024-12-30T14:59:59Z</dc:date>
    </item>
    <item>
      <title>Re: S3 access credentials: Pandas vs Spark</title>
      <link>https://community.databricks.com/t5/administration-architecture/s3-access-credentials-pandas-vs-spark/m-p/103569#M2623</link>
      <description>&lt;P&gt;Thank you for a prompt response!&lt;BR /&gt;&lt;BR /&gt;I did install&amp;nbsp;&lt;SPAN&gt;fsspec and s3fs. The error I see is specific to credentials:&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="staskh_0-1735571196930.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13778iF82EEB54F51C23DE/image-size/medium?v=v2&amp;amp;px=400" role="button" title="staskh_0-1735571196930.png" alt="staskh_0-1735571196930.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I just confused as I did provision S3 bucket as "external location" and Spark read the parquet file without any additional credential. Does Pandas use a different access mechanism? Can I use Pandas WITHOUT explicit specification of AWS credentials? Can credentials be configured on the workspace level without needing to include them in each notebook?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;Stas&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 30 Dec 2024 15:10:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/s3-access-credentials-pandas-vs-spark/m-p/103569#M2623</guid>
      <dc:creator>staskh</dc:creator>
      <dc:date>2024-12-30T15:10:24Z</dc:date>
    </item>
    <item>
      <title>Re: S3 access credentials: Pandas vs Spark</title>
      <link>https://community.databricks.com/t5/administration-architecture/s3-access-credentials-pandas-vs-spark/m-p/103574#M2624</link>
      <description>&lt;P&gt;Yes, Pandas uses a different access mechanism compared to Spark when accessing S3. While Spark can leverage the "external location" configuration in Databricks to access S3 without explicitly specifying credentials, Pandas requires explicit AWS credentials to access S3.&lt;BR /&gt;&lt;BR /&gt;You can configure credentials as follows:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Instance Profiles&lt;/STRONG&gt;: Attach an instance profile to your Databricks cluster that has the necessary permissions to access the S3 bucket. This way, the credentials are managed by AWS and are available to all libraries, including Pandas.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Databricks Secrets&lt;/STRONG&gt;: Store your AWS credentials in Databricks Secrets and configure your notebooks to use these secrets. This approach keeps your credentials secure and avoids hardcoding them in your notebooks.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 30 Dec 2024 15:25:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/s3-access-credentials-pandas-vs-spark/m-p/103574#M2624</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2024-12-30T15:25:34Z</dc:date>
    </item>
    <item>
      <title>Re: S3 access credentials: Pandas vs Spark</title>
      <link>https://community.databricks.com/t5/administration-architecture/s3-access-credentials-pandas-vs-spark/m-p/103647#M2627</link>
      <description>&lt;P&gt;Thank you again for such a valuable response!&lt;BR /&gt;&lt;BR /&gt;While recommending using Instance Profile, did you mean a solution described at&amp;nbsp;&lt;A href="https://docs.databricks.com/en/connect/storage/tutorial-s3-instance-profile.html" target="_blank"&gt;https://docs.databricks.com/en/connect/storage/tutorial-s3-instance-profile.html&lt;/A&gt;&amp;nbsp;? It is noted as a "legacy pattern", and Unity Catalog recomended insteat.&amp;nbsp;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Do I understand correctly that Spark library is using Unity Catalog credential model ( and those "external location" provision works well), but Pandas library still follow legacy credential model and need different permisison provisioning?&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;Stas&lt;/P&gt;</description>
      <pubDate>Tue, 31 Dec 2024 08:26:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/s3-access-credentials-pandas-vs-spark/m-p/103647#M2627</guid>
      <dc:creator>staskh</dc:creator>
      <dc:date>2024-12-31T08:26:54Z</dc:date>
    </item>
    <item>
      <title>Re: S3 access credentials: Pandas vs Spark</title>
      <link>https://community.databricks.com/t5/administration-architecture/s3-access-credentials-pandas-vs-spark/m-p/103712#M2632</link>
      <description>&lt;P class="_1t7bu9h1 paragraph"&gt;Yes, you understand correctly. The Spark library in Databricks uses the Unity Catalog credential model, which includes the use of "external locations" for managing data access. This model ensures that access control and permissions are centrally managed and enforced through Unity Catalog.&lt;/P&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;On the other hand, the Pandas library still follows the legacy credential model. This means that it requires different permission provisioning compared to the Unity Catalog model used by Spark.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 31 Dec 2024 14:59:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/s3-access-credentials-pandas-vs-spark/m-p/103712#M2632</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2024-12-31T14:59:00Z</dc:date>
    </item>
  </channel>
</rss>

