Databricks Community

staskh · ‎12-30-2024

Hi,

I need to read Parquet files located in S3 into the Pandas dataframe.

I configured "external location" to access my S3 bucket and have

df = spark.read.parquet(s3_parquet_file_path)

working perfectly well.

However,

df = pd.read_parquet(s3_parquet_file_path)

fails with NoCredentials Error (it also requires fsspec and s3fs.)

What am I missing? Do I need to provision "credentials" in addition to "external location"?

Regards

Stas

Walter_C · ‎12-30-2024

Yes, Pandas uses a different access mechanism compared to Spark when accessing S3. While Spark can leverage the "external location" configuration in Databricks to access S3 without explicitly specifying credentials, Pandas requires explicit AWS credentials to access S3.

You can configure credentials as follows:

Instance Profiles: Attach an instance profile to your Databricks cluster that has the necessary permissions to access the S3 bucket. This way, the credentials are managed by AWS and are available to all libraries, including Pandas.
Databricks Secrets: Store your AWS credentials in Databricks Secrets and configure your notebooks to use these secrets. This approach keeps your credentials secure and avoids hardcoding them in your notebooks.

View solution in original post

Walter_C · ‎12-30-2024

May I know the exact error message being received?

Can you confirm you have the following set:

To read Parquet files from S3 into a Pandas DataFrame, you need to ensure that the necessary libraries (fsspec and s3fs) are installed and that the appropriate credentials are provided. Here are the steps you can follow:

Install the required libraries:
```
%pip install fsspec s3fs
```
Provide AWS credentials: You need to ensure that your AWS credentials are accessible to s3fs.

staskh · ‎12-30-2024

Thank you for a prompt response!

I did install fsspec and s3fs. The error I see is specific to credentials:

I just confused as I did provision S3 bucket as "external location" and Spark read the parquet file without any additional credential. Does Pandas use a different access mechanism? Can I use Pandas WITHOUT explicit specification of AWS credentials? Can credentials be configured on the workspace level without needing to include them in each notebook?

Regards

Stas

Walter_C · ‎12-30-2024

Yes, Pandas uses a different access mechanism compared to Spark when accessing S3. While Spark can leverage the "external location" configuration in Databricks to access S3 without explicitly specifying credentials, Pandas requires explicit AWS credentials to access S3.

You can configure credentials as follows:

Instance Profiles: Attach an instance profile to your Databricks cluster that has the necessary permissions to access the S3 bucket. This way, the credentials are managed by AWS and are available to all libraries, including Pandas.
Databricks Secrets: Store your AWS credentials in Databricks Secrets and configure your notebooks to use these secrets. This approach keeps your credentials secure and avoids hardcoding them in your notebooks.

staskh · ‎12-31-2024

Thank you again for such a valuable response!

While recommending using Instance Profile, did you mean a solution described at https://docs.databricks.com/en/connect/storage/tutorial-s3-instance-profile.html ? It is noted as a "legacy pattern", and Unity Catalog recomended insteat.

Do I understand correctly that Spark library is using Unity Catalog credential model ( and those "external location" provision works well), but Pandas library still follow legacy credential model and need different permisison provisioning?

Regards

Stas

Walter_C · ‎12-31-2024

Yes, you understand correctly. The Spark library in Databricks uses the Unity Catalog credential model, which includes the use of "external locations" for managing data access. This model ensures that access control and permissions are centrally managed and enforced through Unity Catalog.

On the other hand, the Pandas library still follows the legacy credential model. This means that it requires different permission provisioning compared to the Unity Catalog model used by Spark.

Databricks Community

S3 access credentials: Pandas vs Spark

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!