How would i upload file stream object to S3 bucket using pyspark?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2022 06:51 AM
I could able to save data using pyspark into S3 but not sure on how to save a file stream object into S3 bucket using pyspark. I could achieve this with help of python but when Unity catalog was enabled on Databrciks it always ends up with an access denied exception.
I added a screenshot and sample code here to check for the same
%python
import requests
import json
import io
import boto3
s3_client = boto3.client('s3')
r = s3_client.put_object(Body="response.content", Bucket="bucketName", Key="fileName")
My questions are
Why able to save data into an Amazon S3 bucket using Pyspak but not with Python?
When saving data is working with pyspark then why it is not working with python? If there is a reason for it how can we save a file using pyspark?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2022 07:12 PM
Hi @rammy,
Only reason Access denied exception is happening because of,
Makes sure your unity catalog metastore tagged STORAGE has right set of permissions (storage credentials) to allow the WRITE object of S3. Also make sure to check your IAM role got tagged with this policy,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2022 11:10 PM
Thanks for your response @SENTHIL KUMARR MALLI SUDARSAN . I will try it out. If it is required to configure then why Pyspark allowing to write to S3?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2022 11:14 PM
Make sure METASTORE linked unity_catalog storage bucket you r not writing to. Also try to write in different bucket and show the exception what you r getting? Also tell which regions METASTORE bucket linked and which region BUCKET you r writing to
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-02-2022 10:25 AM
I got to know that there is a change required at Unity-catalog to make it work with Python and got a recommendation to use pyspark to store file into S3.
I do not see much information about storing a file stream object in an S3 bucket anywhere. Can anyone share an example of it.
By the way will pyspark support .docx format to save?
data:image/s3,"s3://crabby-images/d6be0/d6be025e52e1a61c30ea16a2fda1ef9155483c43" alt=""
data:image/s3,"s3://crabby-images/d6be0/d6be025e52e1a61c30ea16a2fda1ef9155483c43" alt=""