Databricks

mh_db · ‎04-10-2024

I have a pandas dataframe in my Pyspark notebook. I want to save this dataframe to my S3 bucket. I'm using the following command to save it

import boto3

import s3fs

df_summary.to_csv(f"s3://dataconversion/data/exclude",index=False)

but I keep getting this error: ModuleNotFoundError: No module named 'botocore.compress'

I already tried to upgrade boto3 but same error. This problem seems to be with panda libraries only. I'm able to read from CSV with spark.read.format('csv') without issues

Any suggestions?

shan_chandra · ‎04-10-2024

Hi @mh_db - you can import botocore library (or) if it is not found can do a pip install botocore to resolve this. Alternatively, you can maintain the data in a spark dataframe without converting to a pandas dataframe and while writing to a csv. you can use coalesce(1) to write to a single csv file (depending on your requirements).

Databricks

Write to csv file in S3 bucket

Exciting Announcement: Introducing our Learning Library!

Databricks Community Social - May 2024

🔔 Attention Databricks Academy Users: SSO Implementation Incoming! Secure Your Account Today!

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications