cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

S3 access credentials: Pandas vs Spark

staskh
New Contributor II

Hi,

I need to read Parquet files located in S3 into the Pandas dataframe.

I configured "external location" to access my S3 bucket and have

df = spark.read.parquet(s3_parquet_file_path)
working perfectly well.

However, 
df = pd.read_parquet(s3_parquet_file_path)
fails with NoCredentials Error (it also requires fsspec and s3fs.)

What am I missing? Do I need to provision "credentials" in addition to "external location"? 

Regards
Stas

 

1 ACCEPTED SOLUTION

Accepted Solutions

Walter_C
Databricks Employee
Databricks Employee

Yes, Pandas uses a different access mechanism compared to Spark when accessing S3. While Spark can leverage the "external location" configuration in Databricks to access S3 without explicitly specifying credentials, Pandas requires explicit AWS credentials to access S3.

You can configure credentials as follows:

 

  • Instance Profiles: Attach an instance profile to your Databricks cluster that has the necessary permissions to access the S3 bucket. This way, the credentials are managed by AWS and are available to all libraries, including Pandas.
  • Databricks Secrets: Store your AWS credentials in Databricks Secrets and configure your notebooks to use these secrets. This approach keeps your credentials secure and avoids hardcoding them in your notebooks.

 

 

View solution in original post

5 REPLIES 5

Walter_C
Databricks Employee
Databricks Employee

May I know the exact error message being received?

Can you confirm you have the following set:

To read Parquet files from S3 into a Pandas DataFrame, you need to ensure that the necessary libraries (fsspec and s3fs) are installed and that the appropriate credentials are provided. Here are the steps you can follow:

 

  1. Install the required libraries:

    %pip install fsspec s3fs
  2. Provide AWS credentials: You need to ensure that your AWS credentials are accessible to s3fs.

staskh
New Contributor II

Thank you for a prompt response!

I did install fsspec and s3fs. The error I see is specific to credentials:

staskh_0-1735571196930.png

I just confused as I did provision S3 bucket as "external location" and Spark read the parquet file without any additional credential. Does Pandas use a different access mechanism? Can I use Pandas WITHOUT explicit specification of AWS credentials? Can credentials be configured on the workspace level without needing to include them in each notebook?

 

Regards

Stas

 

Walter_C
Databricks Employee
Databricks Employee

Yes, Pandas uses a different access mechanism compared to Spark when accessing S3. While Spark can leverage the "external location" configuration in Databricks to access S3 without explicitly specifying credentials, Pandas requires explicit AWS credentials to access S3.

You can configure credentials as follows:

 

  • Instance Profiles: Attach an instance profile to your Databricks cluster that has the necessary permissions to access the S3 bucket. This way, the credentials are managed by AWS and are available to all libraries, including Pandas.
  • Databricks Secrets: Store your AWS credentials in Databricks Secrets and configure your notebooks to use these secrets. This approach keeps your credentials secure and avoids hardcoding them in your notebooks.

 

 

staskh
New Contributor II

Thank you again for such a valuable response!

While recommending using Instance Profile, did you mean a solution described at https://docs.databricks.com/en/connect/storage/tutorial-s3-instance-profile.html ? It is noted as a "legacy pattern", and Unity Catalog recomended insteat.  

Do I understand correctly that Spark library is using Unity Catalog credential model ( and those "external location" provision works well), but Pandas library still follow legacy credential model and need different permisison provisioning?

Regards

Stas

Walter_C
Databricks Employee
Databricks Employee

Yes, you understand correctly. The Spark library in Databricks uses the Unity Catalog credential model, which includes the use of "external locations" for managing data access. This model ensures that access control and permissions are centrally managed and enforced through Unity Catalog.

On the other hand, the Pandas library still follows the legacy credential model. This means that it requires different permission provisioning compared to the Unity Catalog model used by Spark. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group