- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-22-2022 01:38 AM
Problem statement:
- Source file format : .tar.gz
- Avg size: 10 mb
- number of tar.gz files: 1000
- Each tar.gz file contails around 20000 csv files.
Requirement :
Untar the tar.gz file and write CSV files to blob storage / intermediate storage layer for further processing.
What I have tried:
unTar and write to mount location (Attached Screenshot):
Here I am using hadoop FileUtil library unTar function to unTar and write CSV files to target storage (/dbfs/mnt/ - blob storage).
it takes 1.50 hours to complete the job with 2 worker nodes (4 cores each) cluster.
Untar and write to DBFS Root FileStore:
Here I am using hadoop FileUtil library and unTar function to unTar and write CSV files to target storage (/dbfs/FileStore/ ) it takes just 8 minutes to complete the job with 2 worker nodes (4 cores each) cluster.
Questions:
Why writing to DBFS/FileStore or DBFS/databricks/driver is 15 times faster that writing to DBFS/mnt storage?
what storage and file system does DBFS root (/FileStore , /databricks-datasets , /databricks/driver ) uses in backend? What is size limit for each sub folder?