cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks notebook is taking 2 hours to write to /dbfs/mnt (blob storage). Same job is taking 8 minutes to write to /dbfs/FileStore. I would like to understand why write performance is different in both cases.

Surendra
New Contributor III

Problem statement:

  • Source file format : .tar.gz
  • Avg size: 10 mb
  • number of tar.gz files: 1000
  • Each tar.gz file contails around 20000 csv files.

Requirement : 

Untar the tar.gz file and write CSV files to blob storage / intermediate storage layer for further processing.

What I have tried:

unTar and write to mount location (Attached Screenshot):

Here I am using hadoop FileUtil library unTar function to unTar and write CSV files to target storage (/dbfs/mnt/ - blob storage).

it takes 1.50 hours to complete the job with 2 worker nodes (4 cores each) cluster. 

databricks_write_to_dbfsMountUntar and write to DBFS Root FileStore:

Here I am using hadoop FileUtil library and unTar function to unTar and write CSV files to target storage (/dbfs/FileStore/ ) it takes just 8 minutes to complete the job with 2 worker nodes (4 cores each) cluster.

databricks_write_to_dbfsMount 

Questions: 

Why writing to DBFS/FileStore or DBFS/databricks/driver is 15 times faster that writing to DBFS/mnt storage?

what storage and file system does DBFS root (/FileStore , /databricks-datasets , /databricks/driver ) uses in backend? What is size limit for each sub folder?

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

It is about routing. When you use the local network inside the region, it will be super fast. Even not a local network but the same area is still really fast. However, it will be much slower when it has to be in another region, mainly when it uses public internet.

Something like 13x is what I am expecting. Additionally, routing not inside the local network will generate outbound traffic charges.

View solution in original post

5 REPLIES 5

Hubert-Dudek
Esteemed Contributor III

@Surendranatha Reddy Chappidi​ , It seems that it is a problem with /dbfs/mnt mount, blob storage configuration:

  • blob storage needs to be in the same availability zone as your Databricks,
  • please use a private link so traffic is routed locally, not through the internet (so in the network, there is a private subnet used by Databricks, and should be one more for remote endpoints)
  • please upgrade blob storage to ADLS2

Here I explained how to add ADLS2 and a private link: https://community.databricks.com/s/feed/0D53f00001eQGOHCA4.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Surendranatha Reddy Chappidi​ , Please let us know if @Hubert Dudek​ 's answer helps, or we'll find another solution for you.

Surendra
New Contributor III

@Hubert Dudek​  Thanks for your suggestions.

After creating storage account in same region as databricks I can see that performance is as expected.

Now it is clear that issue is with /mnt/ location is being in different region than databricks.

I would like to understand why it takes 13x more time to write data to different region storage compared to same region storage account?

What is API / protocol does databricks uses in backend to write data to same region and different region ?

Why I concerned is because we are developing service for customers.

Customer can choose storage account region and data bricks account region while deploying this service in their subscription.

If both are different, then customer will face performance issues as I reported earlier.

@Kaniz Fatma​  Kindly help here in understanding it takes 13x more time to write data to different region storage compared to same region storage account?

Hubert-Dudek
Esteemed Contributor III

It is about routing. When you use the local network inside the region, it will be super fast. Even not a local network but the same area is still really fast. However, it will be much slower when it has to be in another region, mainly when it uses public internet.

Something like 13x is what I am expecting. Additionally, routing not inside the local network will generate outbound traffic charges.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Hubert Dudek​ , I Just wanted to thank you. We’re so lucky to have customers like you!

The way you are helping our community is incredible.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group