How to read binary data in pyspark - Databricks Community - 27417

Register to join the community

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

I'm reading binary file http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b using pyspark.

from io importStringIO import array

img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(features): a = array.array('f') a.frombytes(features)return a.tolist()def byte_mapper(bytes): a = array.array('b') a.frombytes(bytes) byte_list = a.tolist() char_list =[255+x if x <0else x for x in byte_list] a.fromlist(char_list)return a.tobytes().decode()

decoded_embeddings = img_embedding_file.map(lambda x:[byte_mapper(x[:10]), mapper(x[10:])])

The file is hosted on s3. The file in each row has first 10 bytes for

product_id

next 4096 bytes as

image_features

I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.

1 REPLY 1

Hi @tourist_on_road, please go through the below spark docs,

https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles

never-displayed

You must be signed in to add attachments

never-displayed

Announcements

PSA: Community Edition retires at the end of 2025 - move to Free Edition today to keep your work.

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Last Chance: Help Shape the 2026 Data + AI Summit | Win a Full Conference Pass

🌟 Community Pulse: Your Weekly Roundup! December 05 – 11, 2025

Jaipur Usergroup First Virtual Meetup: AI/BI Genie + Data Science Careers — 19 Dec | 6 PM IST