<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Handling Binary Files Larger than 2GB in Apache Spark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/handling-binary-files-larger-than-2gb-in-apache-spark/m-p/110229#M43526</link>
    <description>&lt;P&gt;I'm trying to process large binary files (&amp;gt;2GB) in Apache Spark, but I'm running into the following error:&lt;/P&gt;&lt;P&gt;File format is : .mf4&amp;nbsp;&lt;SPAN&gt;(Measurement Data Format)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;org.apache.spark.SparkException: The length of ... is 14749763360, which exceeds the max length allowed: 2147483647.&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What are the best approaches to handle large binary files in Spark? Are there any workarounds, such as splitting the file before processing or using a different format?&lt;/P&gt;&lt;P&gt;Would appreciate any insights or best practices.&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
    <pubDate>Fri, 14 Feb 2025 13:51:28 GMT</pubDate>
    <dc:creator>pra18</dc:creator>
    <dc:date>2025-02-14T13:51:28Z</dc:date>
    <item>
      <title>Handling Binary Files Larger than 2GB in Apache Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/handling-binary-files-larger-than-2gb-in-apache-spark/m-p/110229#M43526</link>
      <description>&lt;P&gt;I'm trying to process large binary files (&amp;gt;2GB) in Apache Spark, but I'm running into the following error:&lt;/P&gt;&lt;P&gt;File format is : .mf4&amp;nbsp;&lt;SPAN&gt;(Measurement Data Format)&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;org.apache.spark.SparkException: The length of ... is 14749763360, which exceeds the max length allowed: 2147483647.&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What are the best approaches to handle large binary files in Spark? Are there any workarounds, such as splitting the file before processing or using a different format?&lt;/P&gt;&lt;P&gt;Would appreciate any insights or best practices.&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Fri, 14 Feb 2025 13:51:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/handling-binary-files-larger-than-2gb-in-apache-spark/m-p/110229#M43526</guid>
      <dc:creator>pra18</dc:creator>
      <dc:date>2025-02-14T13:51:28Z</dc:date>
    </item>
    <item>
      <title>Re: Handling Binary Files Larger than 2GB in Apache Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/handling-binary-files-larger-than-2gb-in-apache-spark/m-p/110340#M43547</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/149245"&gt;@pra18&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;You can&amp;nbsp;split and load the binary files using split command like this.&lt;/P&gt;
&lt;DIV class="p-rich_text_block--no-overflow"&gt;ret = os.system("split -b 4020000 -a 4 -d large_data.dat large_data.dat_split_")&lt;/DIV&gt;
&lt;P&gt;&lt;LI-WRAPPER&gt;&lt;/LI-WRAPPER&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 16 Feb 2025 21:54:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/handling-binary-files-larger-than-2gb-in-apache-spark/m-p/110340#M43547</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-02-16T21:54:17Z</dc:date>
    </item>
    <item>
      <title>Re: Handling Binary Files Larger than 2GB in Apache Spark</title>
      <link>https://community.databricks.com/t5/data-engineering/handling-binary-files-larger-than-2gb-in-apache-spark/m-p/110380#M43555</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/106294"&gt;@Alberto_Umana&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for the response. I didn't understand the command which you mentioned.&lt;BR /&gt;Here is the context where i'm facing this error:&lt;/P&gt;&lt;P&gt;I have folder on ADLS Gen2 with lot of sub folders on year/month/date/HH_MM_SS.mf4.&lt;BR /&gt;These file size range from 1GB to 14 GB.. so on.&lt;/P&gt;&lt;P&gt;Faced error when tried to convert the binaray content to dataframe.&lt;BR /&gt;Command:&lt;/P&gt;&lt;P&gt;mf4_df = spark.read.format("binaryFile") \&lt;BR /&gt;.option("pathGlobFilter", "*.mf4") \&lt;BR /&gt;.option("recursiveFileLookup", "true") \&lt;BR /&gt;.load("/mnt/adls_data/")&lt;/P&gt;&lt;P&gt;Result : mf4_df:pyspark.sql.connect.dataframe.DataFrame&lt;BR /&gt;path:string&lt;BR /&gt;modificationTime:timestamp&lt;BR /&gt;length:long&lt;BR /&gt;content:binary&lt;BR /&gt;&lt;BR /&gt;Then used customer library &lt;EM&gt;"from asammdf import MDF"&lt;/EM&gt; for converting binary content to Dataframe.&lt;BR /&gt;&lt;BR /&gt;Thanks !&lt;/P&gt;</description>
      <pubDate>Mon, 17 Feb 2025 10:38:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/handling-binary-files-larger-than-2gb-in-apache-spark/m-p/110380#M43555</guid>
      <dc:creator>pra18</dc:creator>
      <dc:date>2025-02-17T10:38:13Z</dc:date>
    </item>
  </channel>
</rss>

