cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Can I move a single file larger than 100GB using dbtuils fs?

himanmon
New Contributor III

Hello. I have a file over 100GB. Sometimes this is on the cluster's local path, and sometimes it's on the volume.
And I want to send this to another path on the volume, or to the s3 bucket.

 
dbutils.fs.cp('file:///tmp/test.txt', '/Volumes/catalog/schema/path/').
This is my code.

However, when you want to send a file exceeding 100GB,

IllegalArgumentException: partNumber must be between 1 and 10000 inclusive, but is 10001

An error occurs.

According to what I found, dbutils (spark) divides files into blocks of 10mb (10,485,760 bytes) when transferring them to a volume or s3. And an error seems to occur because 100GB requires more than 10,000 10MB blocks.

So I set spark.hadoop.fs.s3a.multipart.size to 104857600.

himanmon_0-1721181332995.png

However, when I run dbutils.fs.cp it still seems to generate blocks of 10mb each.
And the same error occurs again.
This is because 10mb files are continuously created in the '/tmp/hadoop-root/s3a' .

himanmon_1-1721185042042.png

 

Am I misunderstanding something?

Or a file larger than 100GB cannot be moved with dbutils?

 

 

 

 
 

 

 

2 REPLIES 2

szymon_dybczak
Esteemed Contributor III

Hi @himanmon ,

This is caused because of S3 limit on segment count. The part files can be numbered only from 1 to 10000

After Setting spark.hadoop.fs.s3a.multipart.size to 104857600. , did you RESTART the cluster? Because it'll only work when the cluster is restarted. 

Also, before sending this file you can try to compress it using gzip

Hi @szymon_dybczak , Thank you for your answer.
Of course, I restarted the cluster. However, it still uses 10mb blocks.
I understand that compression could be an option, but there doesn't seem to be any solution when the compressed file exceeds 100GB.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group