<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to read binary data in pyspark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-read-binary-data-in-pyspark/m-p/27418#M19292</link>
    <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hi @tourist_on_road, please go through the below spark docs,&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles" target="_blank"&gt;https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 17 Dec 2019 06:00:26 GMT</pubDate>
    <dc:creator>shyam_9</dc:creator>
    <dc:date>2019-12-17T06:00:26Z</dc:date>
    <item>
      <title>How to read binary data in pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-binary-data-in-pyspark/m-p/27417#M19291</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I'm reading binary file &lt;A href="http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b" target="test_blank"&gt;http://snap.stanford.edu/data/amazon/productGraph/image_features/image_features.b&lt;/A&gt; using pyspark.&lt;/P&gt;from io importStringIO import array
&lt;P&gt;&lt;/P&gt; 
&lt;P&gt;img_embedding_file = sc.binaryRecords("s3://bucket/image_features.b",4106)def mapper(features): a = array.array('f') a.frombytes(features)return a.tolist()def byte_mapper(bytes): a = array.array('b') a.frombytes(bytes) byte_list = a.tolist() char_list =[255+x if x &amp;lt;0else x for x in byte_list] a.fromlist(char_list)return a.tobytes().decode()&lt;/P&gt; 
&lt;P&gt;decoded_embeddings = img_embedding_file.map(lambda x:[byte_mapper(x[:10]), mapper(x[10:])])&lt;/P&gt;
&lt;P&gt;The file is hosted on s3. The file in each row has first 10 bytes for &lt;PRE&gt;&lt;CODE&gt;product_id&lt;/CODE&gt;&lt;/PRE&gt; next 4096 bytes as &lt;PRE&gt;&lt;CODE&gt;image_features&lt;/CODE&gt;&lt;/PRE&gt; I'm able to extract all the 4096 image features but facing issue when reading the first 10 bytes and converting it into proper readable format.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 13 Dec 2019 00:47:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-binary-data-in-pyspark/m-p/27417#M19291</guid>
      <dc:creator>tourist_on_road</dc:creator>
      <dc:date>2019-12-13T00:47:16Z</dc:date>
    </item>
    <item>
      <title>Re: How to read binary data in pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-binary-data-in-pyspark/m-p/27418#M19292</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hi @tourist_on_road, please go through the below spark docs,&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles" target="_blank"&gt;https://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.SparkContext.binaryFiles&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Dec 2019 06:00:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-binary-data-in-pyspark/m-p/27418#M19292</guid>
      <dc:creator>shyam_9</dc:creator>
      <dc:date>2019-12-17T06:00:26Z</dc:date>
    </item>
  </channel>
</rss>

