How to read binary data in pyspark

tourist_on_road
New Contributor

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.

from io importStringIO import array

img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(features): a = array.array('f') a.frombytes(features)return a.tolist()def byte_mapper(bytes): a = array.array('b') a.frombytes(bytes) byte_list = a.tolist() char_list =[255+x if x <0else x for x in byte_list] a.fromlist(char_list)return a.tobytes().decode()

decoded_embeddings = img_embedding_file.map(lambda x:[byte_mapper(x[:10]), mapper(x[10:])])

The file is hosted on s3. The file in each row has first 10 bytes for

product_id
next 4096 bytes as
image_features
I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.

shyam_9
Databricks Employee
Databricks Employee

Hi @tourist_on_road, please go through the below spark docs,

https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles