Databricks Community

himanmon · ‎07-16-2024

Hello. I have a file over 100GB. Sometimes this is on the cluster's local path, and sometimes it's on the volume.
And I want to send this to another path on the volume, or to the s3 bucket.

dbutils.fs.cp('file:///tmp/test.txt', '/Volumes/catalog/schema/path/').

This is my code.

However, when you want to send a file exceeding 100GB,

IllegalArgumentException: partNumber must be between 1 and 10000 inclusive, but is 10001

An error occurs.

According to what I found, dbutils (spark) divides files into blocks of 10mb (10,485,760 bytes) when transferring them to a volume or s3. And an error seems to occur because 100GB requires more than 10,000 10MB blocks.

So I set spark.hadoop.fs.s3a.multipart.size to 104857600.

However, when I run dbutils.fs.cp it still seems to generate blocks of 10mb each.
And the same error occurs again.
This is because 10mb files are continuously created in the '/tmp/hadoop-root/s3a' .

Am I misunderstanding something?

Or a file larger than 100GB cannot be moved with dbutils?

szymon_dybczak · ‎07-16-2024

Hi @himanmon ,

This is caused because of S3 limit on segment count. The part files can be numbered only from 1 to 10000

After Setting spark.hadoop.fs.s3a.multipart.size to 104857600. , did you RESTART the cluster? Because it'll only work when the cluster is restarted.

Also, before sending this file you can try to compress it using gzip

himanmon · ‎07-17-2024

Hi @szymon_dybczak , Thank you for your answer.
Of course, I restarted the cluster. However, it still uses 10mb blocks.
I understand that compression could be an option, but there doesn't seem to be any solution when the compressed file exceeds 100GB.

Databricks Community

Can I move a single file larger than 100GB using dbtuils fs?

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences