topic Re: S3 access credentials: Pandas vs Spark in Administration & Architecture

S3 access credentials: Pandas vs Spark

staskh — Mon, 30 Dec 2024 14:56:41 GMT

Hi,

I need to read Parquet files located in S3 into the Pandas dataframe.

I configured "external location" to access my S3 bucket and have

df = spark.read.parquet(s3_parquet_file_path)

working perfectly well.

However,

df = pd.read_parquet(s3_parquet_file_path)

fails with NoCredentials Error (it also requires fsspec and s3fs.)

What am I missing? Do I need to provision "credentials" in addition to "external location"?

Regards

Stas

Re: S3 access credentials: Pandas vs Spark

Walter_C — Mon, 30 Dec 2024 14:59:59 GMT

May I know the exact error message being received?

Can you confirm you have the following set:

To read Parquet files from S3 into a Pandas DataFrame, you need to ensure that the necessary libraries (fsspec and s3fs) are installed and that the appropriate credentials are provided. Here are the steps you can follow:

Install the required libraries:
```
%pip install fsspec s3fs
```
Provide AWS credentials: You need to ensure that your AWS credentials are accessible to s3fs.

Re: S3 access credentials: Pandas vs Spark

staskh — Mon, 30 Dec 2024 15:10:24 GMT

Thank you for a prompt response!

I did install fsspec and s3fs. The error I see is specific to credentials:

I just confused as I did provision S3 bucket as "external location" and Spark read the parquet file without any additional credential. Does Pandas use a different access mechanism? Can I use Pandas WITHOUT explicit specification of AWS credentials? Can credentials be configured on the workspace level without needing to include them in each notebook?

Regards

Stas

Re: S3 access credentials: Pandas vs Spark

Walter_C — Mon, 30 Dec 2024 15:25:34 GMT

Yes, Pandas uses a different access mechanism compared to Spark when accessing S3. While Spark can leverage the "external location" configuration in Databricks to access S3 without explicitly specifying credentials, Pandas requires explicit AWS credentials to access S3.

You can configure credentials as follows:

Instance Profiles: Attach an instance profile to your Databricks cluster that has the necessary permissions to access the S3 bucket. This way, the credentials are managed by AWS and are available to all libraries, including Pandas.
Databricks Secrets: Store your AWS credentials in Databricks Secrets and configure your notebooks to use these secrets. This approach keeps your credentials secure and avoids hardcoding them in your notebooks.

Re: S3 access credentials: Pandas vs Spark

staskh — Tue, 31 Dec 2024 08:26:54 GMT

Thank you again for such a valuable response!

While recommending using Instance Profile, did you mean a solution described at https://docs.databricks.com/en/connect/storage/tutorial-s3-instance-profile.html ? It is noted as a "legacy pattern", and Unity Catalog recomended insteat.

Do I understand correctly that Spark library is using Unity Catalog credential model ( and those "external location" provision works well), but Pandas library still follow legacy credential model and need different permisison provisioning?

Regards

Stas

Re: S3 access credentials: Pandas vs Spark

Walter_C — Tue, 31 Dec 2024 14:59:00 GMT

Yes, you understand correctly. The Spark library in Databricks uses the Unity Catalog credential model, which includes the use of "external locations" for managing data access. This model ensures that access control and permissions are centrally managed and enforced through Unity Catalog.

On the other hand, the Pandas library still follows the legacy credential model. This means that it requires different permission provisioning compared to the Unity Catalog model used by Spark.