cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to read binary data in pyspark

tourist_on_road
New Contributor

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.

from io importStringIO import array

img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(features): a = array.array('f') a.frombytes(features)return a.tolist()def byte_mapper(bytes): a = array.array('b') a.frombytes(bytes) byte_list = a.tolist() char_list =[255+x if x <0else x for x in byte_list] a.fromlist(char_list)return a.tobytes().decode()

decoded_embeddings = img_embedding_file.map(lambda x:[byte_mapper(x[:10]), mapper(x[10:])])

The file is hosted on s3. The file in each row has first 10 bytes for

product_id
next 4096 bytes as
image_features
I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.

1 REPLY 1

shyam_9
Valued Contributor
Valued Contributor

Hi @tourist_on_road, please go through the below spark docs,

https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.