How to get the total directory size using dbutils

gtaspark
New Contributor II

Is there a way to get the directory size in ADLS(gen2) using dbutils in databricks?

If I run this

dbutils.fs.ls("/mnt/abc/xyz")

I get the file sizes inside the xyz folder( there are about 5000 files), I want to get the size of the XYZ folder

how can I achieve this, any help is appreciated

shyam_9
Databricks Employee
Databricks Employee

Hi @gtaspark,

please size command to get size as in below docs,

https://docs.databricks.com/dev-tools/databricks-utils.html#dbutilsfsls-command

Tthere is no

size
command, and
ls
returns 0 for directories.

UmakanthSingalr
New Contributor II

@gtaspark

%scala

val path="/mnt/abc/xyz"

val filelist=dbutils.fs.ls(path)

val df = filelist.toDF() df.createOrReplaceTempView("adlsSize")

spark.sql("select sum(size)/(1024*1024*1024) as sizeInGB from adlsSize").show()

Breitenberg
New Contributor II

I want to calculate a directory(e.g- XYZ) size which contains sub folders and sub files. I want total size of all the files and everything inside XYZ.

I could find out all the folders inside a particular path. But I want size of all together. Also I see

display(dbutils.fs.ls("/mnt/datalake/.../XYZ/.../abc.parquet"))

gives me data size of abc file. But I want complete size of XYZ.

Hari_Gopinath
Databricks Employee
Databricks Employee

Hi,

You can use the disk usage unix command in a notebook in order to get the size. As you might know, any dbfs directory has mount on the unix system as well and you can access it using /dbfs.

%sh du -h /dbfs/mnt/abc/xyz

Atanu
Databricks Employee
Databricks Employee

dbutils.fs.ls("/tmp") should give you size. @gtaspark​ 

View solution in original post

JonathanCastro
New Contributor II

I have found this on internet:

from dbruntime.dbutils import FileInfo

def get_size_of_path(path):

  return sum([file.size for file in get_all_files_in_path(path)])

def get_all_files_in_path(path, verbose=False):

  nodes_new = []

  nodes_new = dbutils.fs.ls(path)

  files = []

  while len(nodes_new) > 0:

    current_nodes = nodes_new

    nodes_new = []

    for node in current_nodes:

      if verbose:

        print(f"Processing {node.path}")

      children = dbutils.fs.ls(node.path)

      for child in children:

        if child.size == 0 and child.path != node.path:

          nodes_new.append(child)

        elif child.path != node.path:

          files.append(child)

  return files

path = "mnt/silver/delta/yourfolder/"

print(f"Size of {path} in gb: {get_size_of_path(path) / 1024 / 1024 / 1024}")

And worked perfectly.

Trabaja, "casi" perfectamente. La verdad el código tiene un bug en el recorrido, el cual se corrige cambiando la línea "elif child.path != node.path:" por "else:". Adicionalmente, se puede mejorar incluyendo el envio del flag verbose.

Quedaría...

 

 

from dbruntime.dbutils import FileInfo

def get_size_of_path(path, verbose=False):

  return sum([file.size for file in get_all_files_in_path(path, verbose)])

def get_all_files_in_path(path, verbose=False):

  nodes_new = []

  nodes_new = dbutils.fs.ls(path)

  files = []

  while len(nodes_new) > 0:

    current_nodes = nodes_new

    nodes_new = []

    for node in current_nodes:

      if verbose:

        print(f"Processing {node.path}")

      children = dbutils.fs.ls(node.path)

      for child in children:

        if verbose:

          print(f"Processing {child.path} [{child.size} bytes] in {node.path}")

        if child.size == 0 and child.path != node.path:

          nodes_new.append(child)

        #elif child.path != node.path:
        else:

          files.append(child)

  return files

 

 

 

User16788316720
Databricks Employee
Databricks Employee

File size is only specified for files. So, if you specify a directory as your source, you have to iterate through the directory. The below snippet should work (and should be faster than the other solutions).

import glob
 
def get_directory_size_in_bytes(source_path: dir, pattern: str = '**/*.parquet') -> int:
    source_path = '/dbfs/' + source_path.replace('dbfs','').lstrip('/').lstrip(':').rstrip('/')
 
    files = glob.glob(f'{source_path}{pattern}')
    directory_size = sum([dbutils.fs.ls(path.replace('/dbfs/',''))[0].size for path in files])
 
    return directory_size