topic Handling Binary Files Larger than 2GB in Apache Spark in Data Engineering

Handling Binary Files Larger than 2GB in Apache Spark

pra18 — Fri, 14 Feb 2025 13:51:28 GMT

I'm trying to process large binary files (>2GB) in Apache Spark, but I'm running into the following error:

File format is : .mf4 (Measurement Data Format)

org.apache.spark.SparkException: The length of ... is 14749763360, which exceeds the max length allowed: 2147483647.

What are the best approaches to handle large binary files in Spark? Are there any workarounds, such as splitting the file before processing or using a different format?

Would appreciate any insights or best practices.

Thanks!

Re: Handling Binary Files Larger than 2GB in Apache Spark

Alberto_Umana — Sun, 16 Feb 2025 21:54:17 GMT

Hi @pra18,

You can split and load the binary files using split command like this.

ret = os.system("split -b 4020000 -a 4 -d large_data.dat large_data.dat_split_")

Re: Handling Binary Files Larger than 2GB in Apache Spark

pra18 — Mon, 17 Feb 2025 10:38:13 GMT

Hi @Alberto_Umana

Thank you for the response. I didn't understand the command which you mentioned.
Here is the context where i'm facing this error:

I have folder on ADLS Gen2 with lot of sub folders on year/month/date/HH_MM_SS.mf4.
These file size range from 1GB to 14 GB.. so on.

Faced error when tried to convert the binaray content to dataframe.
Command:

mf4_df = spark.read.format("binaryFile") \
.option("pathGlobFilter", "*.mf4") \
.option("recursiveFileLookup", "true") \
.load("/mnt/adls_data/")

Result : mf4_df:pyspark.sql.connect.dataframe.DataFrame
path:string
modificationTime:timestamp
length:long
content:binary

Then used customer library "from asammdf import MDF" for converting binary content to Dataframe.

Thanks !