cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to get the total directory size using dbutils

gtaspark
New Contributor II

Is there a way to get the directory size in ADLS(gen2) using dbutils in databricks?

If I run this

dbutils.fs.ls("/mnt/abc/xyz")

I get the file sizes inside the xyz folder( there are about 5000 files), I want to get the size of the XYZ folder

how can I achieve this, any help is appreciated

1 ACCEPTED SOLUTION

Accepted Solutions

Atanu
Esteemed Contributor
Esteemed Contributor

dbutils.fs.ls("/tmp") should give you size. @gtasparkโ€‹ 

View solution in original post

9 REPLIES 9

shyam_9
Valued Contributor
Valued Contributor

Hi @gtaspark,

please size command to get size as in below docs,

https://docs.databricks.com/dev-tools/databricks-utils.html#dbutilsfsls-command

Tthere is no

size
command, and
ls
returns 0 for directories.

UmakanthSingalr
New Contributor II

@gtaspark

%scala

val path="/mnt/abc/xyz"

val filelist=dbutils.fs.ls(path)

val df = filelist.toDF() df.createOrReplaceTempView("adlsSize")

spark.sql("select sum(size)/(1024*1024*1024) as sizeInGB from adlsSize").show()

Breitenberg
New Contributor II

I want to calculate a directory(e.g- XYZ) size which contains sub folders and sub files. I want total size of all the files and everything inside XYZ.

I could find out all the folders inside a particular path. But I want size of all together. Also I see

display(dbutils.fs.ls("/mnt/datalake/.../XYZ/.../abc.parquet"))

gives me data size of abc file. But I want complete size of XYZ.

Hari_Gopinath
New Contributor II

Hi,

You can use the disk usage unix command in a notebook in order to get the size. As you might know, any dbfs directory has mount on the unix system as well and you can access it using /dbfs.

%sh du -h /dbfs/mnt/abc/xyz

Atanu
Esteemed Contributor
Esteemed Contributor

dbutils.fs.ls("/tmp") should give you size. @gtasparkโ€‹ 

jcastro
New Contributor II

I have found this on internet:

from dbruntime.dbutils import FileInfo

def get_size_of_path(path):

  return sum([file.size for file in get_all_files_in_path(path)])

def get_all_files_in_path(path, verbose=False):

  nodes_new = []

  nodes_new = dbutils.fs.ls(path)

  files = []

  while len(nodes_new) > 0:

    current_nodes = nodes_new

    nodes_new = []

    for node in current_nodes:

      if verbose:

        print(f"Processing {node.path}")

      children = dbutils.fs.ls(node.path)

      for child in children:

        if child.size == 0 and child.path != node.path:

          nodes_new.append(child)

        elif child.path != node.path:

          files.append(child)

  return files

path = "mnt/silver/delta/yourfolder/"

print(f"Size of {path} in gb: {get_size_of_path(path) / 1024 / 1024 / 1024}")

And worked perfectly.

Trabaja, "casi" perfectamente. La verdad el cรณdigo tiene un bug en el recorrido, el cual se corrige cambiando la lรญnea "elif child.path != node.path:" por "else:". Adicionalmente, se puede mejorar incluyendo el envio del flag verbose.

Quedarรญa...

 

 

from dbruntime.dbutils import FileInfo

def get_size_of_path(path, verbose=False):

  return sum([file.size for file in get_all_files_in_path(path, verbose)])

def get_all_files_in_path(path, verbose=False):

  nodes_new = []

  nodes_new = dbutils.fs.ls(path)

  files = []

  while len(nodes_new) > 0:

    current_nodes = nodes_new

    nodes_new = []

    for node in current_nodes:

      if verbose:

        print(f"Processing {node.path}")

      children = dbutils.fs.ls(node.path)

      for child in children:

        if verbose:

          print(f"Processing {child.path} [{child.size} bytes] in {node.path}")

        if child.size == 0 and child.path != node.path:

          nodes_new.append(child)

        #elif child.path != node.path:
        else:

          files.append(child)

  return files

 

 

 

User16788316720
New Contributor III

File size is only specified for files. So, if you specify a directory as your source, you have to iterate through the directory. The below snippet should work (and should be faster than the other solutions).

import glob
 
def get_directory_size_in_bytes(source_path: dir, pattern: str = '**/*.parquet') -> int:
    source_path = '/dbfs/' + source_path.replace('dbfs','').lstrip('/').lstrip(':').rstrip('/')
 
    files = glob.glob(f'{source_path}{pattern}')
    directory_size = sum([dbutils.fs.ls(path.replace('/dbfs/',''))[0].size for path in files])
 
    return directory_size

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!