How to read binary data in pyspark

tourist_on_road — Fri, 13 Dec 2019 00:47:16 GMT

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.

from io importStringIO import array

img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(features): a = array.array('f') a.frombytes(features)return a.tolist()def byte_mapper(bytes): a = array.array('b') a.frombytes(bytes) byte_list = a.tolist() char_list =[255+x if x <0else x for x in byte_list] a.fromlist(char_list)return a.tobytes().decode()

decoded_embeddings = img_embedding_file.map(lambda x:[byte_mapper(x[:10]), mapper(x[10:])])

The file is hosted on s3. The file in each row has first 10 bytes for

product_id

next 4096 bytes as

image_features

I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.

Re: How to read binary data in pyspark

shyam_9 — Tue, 17 Dec 2019 06:00:26 GMT

Hi @tourist_on_road, please go through the below spark docs,

https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles

topic Re: How to read binary data in pyspark in Data Engineering

How to read binary data in pyspark

Re: How to read binary data in pyspark