Databricks notebook is taking 2 hours to write to ...

Surendra · ‎04-22-2022

Problem statement:

Source file format : .tar.gz
Avg size: 10 mb
number of tar.gz files: 1000
Each tar.gz file contails around 20000 csv files.

Requirement :

Untar the tar.gz file and write CSV files to blob storage / intermediate storage layer for further processing.

What I have tried:

unTar and write to mount location (Attached Screenshot):

Here I am using hadoop FileUtil library unTar function to unTar and write CSV files to target storage (/dbfs/mnt/ - blob storage).

it takes 1.50 hours to complete the job with 2 worker nodes (4 cores each) cluster.

Untar and write to DBFS Root FileStore:

Here I am using hadoop FileUtil library and unTar function to unTar and write CSV files to target storage (/dbfs/FileStore/ ) it takes just 8 minutes to complete the job with 2 worker nodes (4 cores each) cluster.

Questions:

Why writing to DBFS/FileStore or DBFS/databricks/driver is 15 times faster that writing to DBFS/mnt storage?

what storage and file system does DBFS root (/FileStore , /databricks-datasets , /databricks/driver ) uses in backend? What is size limit for each sub folder?

Databricks notebook is taking 2 hours to write to /dbfs/mnt (blob storage). Same job is taking 8 minutes to write to /dbfs/FileStore. I would like to understand why write performance is different in both cases.