Databricks notebook is taking 2 hours to write to /dbfs/mnt (blob storage). Same job is taking 8 minutes to write to /dbfs/FileStore. I would like to understand why write performance is different in both cases.

Surendra
New Contributor III

Problem statement:

  • Source file format : .tar.gz
  • Avg size: 10 mb
  • number of tar.gz files: 1000
  • Each tar.gz file contails around 20000 csv files.

Requirement : 

Untar the tar.gz file and write CSV files to blob storage / intermediate storage layer for further processing.

What I have tried:

unTar and write to mount location (Attached Screenshot):

Here I am using hadoop FileUtil library unTar function to unTar and write CSV files to target storage (/dbfs/mnt/ - blob storage).

it takes 1.50 hours to complete the job with 2 worker nodes (4 cores each) cluster. 

databricks_write_to_dbfsMountUntar and write to DBFS Root FileStore:

Here I am using hadoop FileUtil library and unTar function to unTar and write CSV files to target storage (/dbfs/FileStore/ ) it takes just 8 minutes to complete the job with 2 worker nodes (4 cores each) cluster.

databricks_write_to_dbfsMount 

Questions: 

Why writing to DBFS/FileStore or DBFS/databricks/driver is 15 times faster that writing to DBFS/mnt storage?

what storage and file system does DBFS root (/FileStore , /databricks-datasets , /databricks/driver ) uses in backend? What is size limit for each sub folder?