cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Data migration from S3 to Databricks

DBStudent
New Contributor II

I currently have an S3 bucket with around ~80 tables, each of which has hive-style partition columns 

 

S3RootFolder/Table1Name/Year=2024/Month=12/Day=1/xxx.parquet

S3RootFolder/Table1Name/Year=2024/Month=12/Day=2/xxx.parquet

S3RootFolder/Table2Name/Year=2024/Month=12/Day=1/xxx.parquet

 

The total size of the S3 bucket is roughly 3GB but there around 110k files across these 80 tables, and a full load using the below approach takes around 20 minutes using a 30GB 16core driver and 2x 30GB 4core workers 

def Batch_read(table_name):

          df1 = spark.read.parquet(...)

          df1.write.saveAsTable(...)

pool = Threadpool(8)

pool.map(Batch_read, table_list)

Now 20 minutes for this by itself not a problem but the folder is around only 2.5GB with 110k files, and in the future we may have to repeat this migration to anywhere between 20-50TB of data with thousands of tables, at which point this rate will be a problem. Is there a better approach to go about this?

3 REPLIES 3

Brahmareddy
Honored Contributor III

Hi DBStudent,

How are you doing today?, As per my understanding, You're absolutely right to be thinking about this now—your current load time isn't bad for 3 GB, but the real problem is the huge number of small files (110k!), not the data size itself. When you scale to 20–50 TB, this same approach will become a bottleneck. Spark slows down a lot when it has to scan and manage tons of tiny files, because it spends more time on file metadata than processing data. A good fix is to compact the files first by reading them in and writing them out with fewer, larger files using coalesce(). You could also look into using Autoloader, which is built for large file counts and can help with future scaling. Also, think about adjusting your cluster setup or parallelizing the work by table across workflows when you go bigger. Let me know if you’d like a sample compaction or migration setup—I’d be happy to help!

Regards,

Brahma

Hi Brahmareddy, unfortunately coalescing before reading the files is not an option at this point, and I believe that coalescing when writing will not affect the bottleneck issue of opening and reading the 110k files. I have also tried setting the databricks spark optimizedWrite configuration to True and that did not help significantly. Please do suggest any sample setups that you think can help though.

Brahmareddy
Honored Contributor III

Hi DBStudent,

You’re absolutely right—coalescing during write won’t help with the initial bottleneck, since the issue is really with the read side, where Spark has to list and open 110k small files. If you can’t pre-compact them before reading, then one thing that could help is using parallelism at the file level by increasing spark.sql.files.maxPartitionBytes to a lower value (e.g., 8MB) and also tweaking spark.sql.files.openCostInBytes to force Spark to parallelize more aggressively. Another option is to enable Spark’s metadata pruning by setting spark.databricks.optimize.maxFileListingParallelism to a higher value. Also, if you haven’t already, try increasing the number of executors (worker nodes) instead of just beefing up cores per node—this spreads out the file scanning better. If you have a list of known table paths, consider parallelizing the load using Workflows or even multiple jobs, so each job picks up a batch of tables. I can help draft a setup using Workflows + table batches if that sounds useful!

Regards,

Brahma