I currently have an S3 bucket with around ~80 tables, each of which has hive-style partition columns
S3RootFolder/Table1Name/Year=2024/Month=12/Day=1/xxx.parquet
S3RootFolder/Table1Name/Year=2024/Month=12/Day=2/xxx.parquet
S3RootFolder/Table2Name/Year=2024/Month=12/Day=1/xxx.parquet
The total size of the S3 bucket is roughly 3GB but there around 110k files across these 80 tables, and a full load using the below approach takes around 20 minutes using a 30GB 16core driver and 2x 30GB 4core workers
def Batch_read(table_name):
df1 = spark.read.parquet(...)
df1.write.saveAsTable(...)
pool = Threadpool(8)
pool.map(Batch_read, table_list)
Now 20 minutes for this by itself not a problem but the folder is around only 2.5GB with 110k files, and in the future we may have to repeat this migration to anywhere between 20-50TB of data with thousands of tables, at which point this rate will be a problem. Is there a better approach to go about this?