cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Query on DBFS migration

Harsh1
New Contributor II

We are doing DBFS migration. In that we have a folder 'user' in Root DBFS having data 5.8 TB in legacy workspace. We performed AWS CLi Sync/cp between Legacy to Target and again performed the same between Target bucket to Target dbfs   

While implementing this technique we migrated the folders that were in /mnt and /dbfs-root to target root bucket. While migrating the /dbfs-root (user, FileStore, home) we encountered a problem it seems to be very slow while moving /dbfs/user

/user - 5.8TB

/home - 680 GB

/FileStore - 181 GB 

Note - This is only slow while performing the migration from Target S3 bucket to /dbfs/user 

Status Update on /dbfs/user till now:

Data Migration Status - 750 GB / 5.8 TB

Completion Rate ~12.9 %

Data transfer by AWS sync till now : ~403 GB

We are pretty curious as it is only happening for the user and it tends to be very slow. Around 200 GB a Day. But this was not the scenario for /home and /FileStore.

Please suggest best practices to mount /user folder to target workspace when looking at this data.

Methods already used:

  1. dbutils.fs.cp()
  2. aws s3 sync
  3. aws s3 cp
2 REPLIES 2

Hubert-Dudek
Esteemed Contributor III

dbutils.fs.cp() and other dbutils commands will be slow as they use single core only.

Consider using AWS data sync shorturl.at/FNQTV

Harsh1
New Contributor II

Thanks for the quick response.

Regarding the suggested AWS data sync approach, we have tried data sync in multiple ways, it is creating folders in s3 bucket itself not on DBFS. As our task is to copy from bucket to DBFS.

It seems that it only supports bucket level operations not DBFS level.

Please suggest any best practices/approach which can cater our needs. That'll be a great help. Thanks.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group