cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Can I move a single file larger than 100GB using dbtuils fs?

himanmon
New Contributor III

Hello. I have a file over 100GB. Sometimes this is on the cluster's local path, and sometimes it's on the volume.
And I want to send this to another path on the volume, or to the s3 bucket.

 
dbutils.fs.cp('file:///tmp/test.txt', '/Volumes/catalog/schema/path/').
This is my code.

However, when you want to send a file exceeding 100GB,

IllegalArgumentException: partNumber must be between 1 and 10000 inclusive, but is 10001

An error occurs.

According to what I found, dbutils (spark) divides files into blocks of 10mb (10,485,760 bytes) when transferring them to a volume or s3. And an error seems to occur because 100GB requires more than 10,000 10MB blocks.

So I set spark.hadoop.fs.s3a.multipart.size to 104857600.

himanmon_0-1721181332995.png

However, when I run dbutils.fs.cp it still seems to generate blocks of 10mb each.
And the same error occurs again.
This is because 10mb files are continuously created in the '/tmp/hadoop-root/s3a' .

himanmon_1-1721185042042.png

 

Am I misunderstanding something?

Or a file larger than 100GB cannot be moved with dbutils?

 

 

 

 
 

 

 

2 REPLIES 2

szymon_dybczak
Esteemed Contributor III

Hi @himanmon ,

This is caused because of S3 limit on segment count. The part files can be numbered only from 1 to 10000

After Setting spark.hadoop.fs.s3a.multipart.size to 104857600. , did you RESTART the cluster? Because it'll only work when the cluster is restarted. 

Also, before sending this file you can try to compress it using gzip

Hi @szymon_dybczak , Thank you for your answer.
Of course, I restarted the cluster. However, it still uses 10mb blocks.
I understand that compression could be an option, but there doesn't seem to be any solution when the compressed file exceeds 100GB.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now