- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-22-2022 01:38 AM
Problem statement:
- Source file format : .tar.gz
- Avg size: 10 mb
- number of tar.gz files: 1000
- Each tar.gz file contails around 20000 csv files.
Requirement :
Untar the tar.gz file and write CSV files to blob storage / intermediate storage layer for further processing.
What I have tried:
unTar and write to mount location (Attached Screenshot):
Here I am using hadoop FileUtil library unTar function to unTar and write CSV files to target storage (/dbfs/mnt/ - blob storage).
it takes 1.50 hours to complete the job with 2 worker nodes (4 cores each) cluster.
Untar and write to DBFS Root FileStore:
Here I am using hadoop FileUtil library and unTar function to unTar and write CSV files to target storage (/dbfs/FileStore/ ) it takes just 8 minutes to complete the job with 2 worker nodes (4 cores each) cluster.
Questions:
Why writing to DBFS/FileStore or DBFS/databricks/driver is 15 times faster that writing to DBFS/mnt storage?
what storage and file system does DBFS root (/FileStore , /databricks-datasets , /databricks/driver ) uses in backend? What is size limit for each sub folder?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-25-2022 10:49 AM
It is about routing. When you use the local network inside the region, it will be super fast. Even not a local network but the same area is still really fast. However, it will be much slower when it has to be in another region, mainly when it uses public internet.
Something like 13x is what I am expecting. Additionally, routing not inside the local network will generate outbound traffic charges.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-22-2022 02:27 AM
@Surendranatha Reddy Chappidi , It seems that it is a problem with /dbfs/mnt mount, blob storage configuration:
- blob storage needs to be in the same availability zone as your Databricks,
- please use a private link so traffic is routed locally, not through the internet (so in the network, there is a private subnet used by Databricks, and should be one more for remote endpoints)
- please upgrade blob storage to ADLS2
Here I explained how to add ADLS2 and a private link: https://community.databricks.com/s/feed/0D53f00001eQGOHCA4.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-25-2022 06:33 AM
@Hubert Dudek Thanks for your suggestions.
After creating storage account in same region as databricks I can see that performance is as expected.
Now it is clear that issue is with /mnt/ location is being in different region than databricks.
I would like to understand why it takes 13x more time to write data to different region storage compared to same region storage account?
What is API / protocol does databricks uses in backend to write data to same region and different region ?
Why I concerned is because we are developing service for customers.
Customer can choose storage account region and data bricks account region while deploying this service in their subscription.
If both are different, then customer will face performance issues as I reported earlier.
@Kaniz Fatma Kindly help here in understanding it takes 13x more time to write data to different region storage compared to same region storage account?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-25-2022 10:49 AM
It is about routing. When you use the local network inside the region, it will be super fast. Even not a local network but the same area is still really fast. However, it will be much slower when it has to be in another region, mainly when it uses public internet.
Something like 13x is what I am expecting. Additionally, routing not inside the local network will generate outbound traffic charges.

