Data migration from S3 to Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
I currently have an S3 bucket with around ~80 tables, each of which has hive-style partition columns
S3RootFolder/Table1Name/Year=2024/Month=12/Day=1/xxx.parquet
S3RootFolder/Table1Name/Year=2024/Month=12/Day=2/xxx.parquet
S3RootFolder/Table2Name/Year=2024/Month=12/Day=1/xxx.parquet
The total size of the S3 bucket is roughly 3GB but there around 110k files across these 80 tables, and a full load using the below approach takes around 20 minutes using a 30GB 16core driver and 2x 30GB 4core workers
def Batch_read(table_name):
df1 = spark.read.parquet(...)
df1.write.saveAsTable(...)
pool = Threadpool(8)
pool.map(Batch_read, table_list)
Now 20 minutes for this by itself not a problem but the folder is around only 2.5GB with 110k files, and in the future we may have to repeat this migration to anywhere between 20-50TB of data with thousands of tables, at which point this rate will be a problem. Is there a better approach to go about this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Hi DBStudent,
How are you doing today?, As per my understanding, You're absolutely right to be thinking about this now—your current load time isn't bad for 3 GB, but the real problem is the huge number of small files (110k!), not the data size itself. When you scale to 20–50 TB, this same approach will become a bottleneck. Spark slows down a lot when it has to scan and manage tons of tiny files, because it spends more time on file metadata than processing data. A good fix is to compact the files first by reading them in and writing them out with fewer, larger files using coalesce(). You could also look into using Autoloader, which is built for large file counts and can help with future scaling. Also, think about adjusting your cluster setup or parallelizing the work by table across workflows when you go bigger. Let me know if you’d like a sample compaction or migration setup—I’d be happy to help!
Regards,
Brahma
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Hi Brahmareddy, unfortunately coalescing before reading the files is not an option at this point, and I believe that coalescing when writing will not affect the bottleneck issue of opening and reading the 110k files. I have also tried setting the databricks spark optimizedWrite configuration to True and that did not help significantly. Please do suggest any sample setups that you think can help though.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Hi DBStudent,
You’re absolutely right—coalescing during write won’t help with the initial bottleneck, since the issue is really with the read side, where Spark has to list and open 110k small files. If you can’t pre-compact them before reading, then one thing that could help is using parallelism at the file level by increasing spark.sql.files.maxPartitionBytes to a lower value (e.g., 8MB) and also tweaking spark.sql.files.openCostInBytes to force Spark to parallelize more aggressively. Another option is to enable Spark’s metadata pruning by setting spark.databricks.optimize.maxFileListingParallelism to a higher value. Also, if you haven’t already, try increasing the number of executors (worker nodes) instead of just beefing up cores per node—this spreads out the file scanning better. If you have a list of known table paths, consider parallelizing the load using Workflows or even multiple jobs, so each job picks up a batch of tables. I can help draft a setup using Workflows + table batches if that sounds useful!
Regards,
Brahma

