04-22-2022 01:38 AM
Problem statement:
Requirement :
Untar the tar.gz file and write CSV files to blob storage / intermediate storage layer for further processing.
What I have tried:
unTar and write to mount location (Attached Screenshot):
Here I am using hadoop FileUtil library unTar function to unTar and write CSV files to target storage (/dbfs/mnt/ - blob storage).
it takes 1.50 hours to complete the job with 2 worker nodes (4 cores each) cluster.
Untar and write to DBFS Root FileStore:
Here I am using hadoop FileUtil library and unTar function to unTar and write CSV files to target storage (/dbfs/FileStore/ ) it takes just 8 minutes to complete the job with 2 worker nodes (4 cores each) cluster.
Questions:
Why writing to DBFS/FileStore or DBFS/databricks/driver is 15 times faster that writing to DBFS/mnt storage?
what storage and file system does DBFS root (/FileStore , /databricks-datasets , /databricks/driver ) uses in backend? What is size limit for each sub folder?
04-25-2022 10:49 AM
It is about routing. When you use the local network inside the region, it will be super fast. Even not a local network but the same area is still really fast. However, it will be much slower when it has to be in another region, mainly when it uses public internet.
Something like 13x is what I am expecting. Additionally, routing not inside the local network will generate outbound traffic charges.
04-22-2022 02:27 AM
@Surendranatha Reddy Chappidi , It seems that it is a problem with /dbfs/mnt mount, blob storage configuration:
Here I explained how to add ADLS2 and a private link: https://community.databricks.com/s/feed/0D53f00001eQGOHCA4.
04-25-2022 06:33 AM
@Hubert Dudek Thanks for your suggestions.
After creating storage account in same region as databricks I can see that performance is as expected.
Now it is clear that issue is with /mnt/ location is being in different region than databricks.
I would like to understand why it takes 13x more time to write data to different region storage compared to same region storage account?
What is API / protocol does databricks uses in backend to write data to same region and different region ?
Why I concerned is because we are developing service for customers.
Customer can choose storage account region and data bricks account region while deploying this service in their subscription.
If both are different, then customer will face performance issues as I reported earlier.
@Kaniz Fatma Kindly help here in understanding it takes 13x more time to write data to different region storage compared to same region storage account?
04-25-2022 10:49 AM
It is about routing. When you use the local network inside the region, it will be super fast. Even not a local network but the same area is still really fast. However, it will be much slower when it has to be in another region, mainly when it uses public internet.
Something like 13x is what I am expecting. Additionally, routing not inside the local network will generate outbound traffic charges.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group