Very large binary files ingestion error when using binaryFile reader

eva_mcmf — Mon, 02 Oct 2023 08:21:16 GMT

Hello, I am facing an error while trying to read a large binary file (rosbag format) using binaryFile reader. The file I am trying to read is approx 7GB large. Here's the error message I am getting:

FileReadException: Error while reading file dbfs:/mnt/0-landingzone/tla/7a0cb35d-b606-4a9e-890b-83fc385f78ca.bag. Caused by: SparkException: The length of dbfs:/mnt/0-landingzone/tla/7a0cb35d-b606-4a9e-890b-83fc385f78ca.bag is 7156086862, which exceeds the max length allowed: 2147483647.

Here's the code:

BINARY_FILES_SCHEMA = StructType(

[

StructField("path", StringType()),

StructField("modificationTime", TimestampType()),

StructField("length", LongType()),

StructField("content", BinaryType()),

]

)

binary_df = spark.read.format("binaryFile").schema(BINARY_FILES_SCHEMA).load("/mnt/0-landingzone/tla/7a0cb35d-b606-4a9e-890b-83fc385f78ca.bag")

binary_df.printSchema()

display(binary_df)

Is there a way to read such large files in Databricks?

topic Very large binary files ingestion error when using binaryFile reader in Get Started Discussions

Very large binary files ingestion error when using binaryFile reader