Handling Binary Files Larger than 2GB in Apache Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-14-2025 05:51 AM
I'm trying to process large binary files (>2GB) in Apache Spark, but I'm running into the following error:
File format is : .mf4 (Measurement Data Format)
org.apache.spark.SparkException: The length of ... is 14749763360, which exceeds the max length allowed: 2147483647.
What are the best approaches to handle large binary files in Spark? Are there any workarounds, such as splitting the file before processing or using a different format?
Would appreciate any insights or best practices.
Thanks!
- Labels:
-
Delta Lake
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-16-2025 01:54 PM
Hi @pra18,
You can split and load the binary files using split command like this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-17-2025 02:37 AM - edited 02-17-2025 02:38 AM
Thank you for the response. I didn't understand the command which you mentioned.
Here is the context where i'm facing this error:
I have folder on ADLS Gen2 with lot of sub folders on year/month/date/HH_MM_SS.mf4.
These file size range from 1GB to 14 GB.. so on.
Faced error when tried to convert the binaray content to dataframe.
Command:
mf4_df = spark.read.format("binaryFile") \
.option("pathGlobFilter", "*.mf4") \
.option("recursiveFileLookup", "true") \
.load("/mnt/adls_data/")
Result : mf4_df:pyspark.sql.connect.dataframe.DataFrame
path:string
modificationTime:timestamp
length:long
content:binary
Then used customer library "from asammdf import MDF" for converting binary content to Dataframe.
Thanks !

