How to read binary data in pyspark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-12-2019 04:47 PM
I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.
from io importStringIO import arrayimg_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(features): a = array.array('f') a.frombytes(features)return a.tolist()def byte_mapper(bytes): a = array.array('b') a.frombytes(bytes) byte_list = a.tolist() char_list =[255+x if x <0else x for x in byte_list] a.fromlist(char_list)return a.tobytes().decode()
decoded_embeddings = img_embedding_file.map(lambda x:[byte_mapper(x[:10]), mapper(x[10:])])
The file is hosted on s3. The file in each row has first 10 bytes for
product_id
next 4096 bytes as image_features
I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.
- Labels:
-
Binary file
-
Parser
-
Read data
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-16-2019 10:00 PM
Hi @tourist_on_road, please go through the below spark docs,
https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles
![](/skins/images/97567C72181EBE789E1F0FD869E4C89B/responsive_peak/images/icon_anonymous_message.png)
![](/skins/images/97567C72181EBE789E1F0FD869E4C89B/responsive_peak/images/icon_anonymous_message.png)