cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to get the total directory size using dbutils

gtaspark
New Contributor II

Is there a way to get the directory size in ADLS(gen2) using dbutils in databricks?

If I run this

dbutils.fs.ls("/mnt/abc/xyz")

I get the file sizes inside the xyz folder( there are about 5000 files), I want to get the size of the XYZ folder

how can I achieve this, any help is appreciated

1 ACCEPTED SOLUTION

Accepted Solutions

Atanu
Databricks Employee
Databricks Employee

dbutils.fs.ls("/tmp") should give you size. @gtaspark​ 

View solution in original post

9 REPLIES 9

shyam_9
Databricks Employee
Databricks Employee

Hi @gtaspark,

please size command to get size as in below docs,

https://docs.databricks.com/dev-tools/databricks-utils.html#dbutilsfsls-command

Tthere is no

size
command, and
ls
returns 0 for directories.

UmakanthSingalr
New Contributor II

@gtaspark

%scala

val path="/mnt/abc/xyz"

val filelist=dbutils.fs.ls(path)

val df = filelist.toDF() df.createOrReplaceTempView("adlsSize")

spark.sql("select sum(size)/(1024*1024*1024) as sizeInGB from adlsSize").show()

Breitenberg
New Contributor II

I want to calculate a directory(e.g- XYZ) size which contains sub folders and sub files. I want total size of all the files and everything inside XYZ.

I could find out all the folders inside a particular path. But I want size of all together. Also I see

display(dbutils.fs.ls("/mnt/datalake/.../XYZ/.../abc.parquet"))

gives me data size of abc file. But I want complete size of XYZ.

Hari_Gopinath
Databricks Employee
Databricks Employee

Hi,

You can use the disk usage unix command in a notebook in order to get the size. As you might know, any dbfs directory has mount on the unix system as well and you can access it using /dbfs.

%sh du -h /dbfs/mnt/abc/xyz

Atanu
Databricks Employee
Databricks Employee

dbutils.fs.ls("/tmp") should give you size. @gtaspark​ 

JonathanCastro
New Contributor II

I have found this on internet:

from dbruntime.dbutils import FileInfo

def get_size_of_path(path):

  return sum([file.size for file in get_all_files_in_path(path)])

def get_all_files_in_path(path, verbose=False):

  nodes_new = []

  nodes_new = dbutils.fs.ls(path)

  files = []

  while len(nodes_new) > 0:

    current_nodes = nodes_new

    nodes_new = []

    for node in current_nodes:

      if verbose:

        print(f"Processing {node.path}")

      children = dbutils.fs.ls(node.path)

      for child in children:

        if child.size == 0 and child.path != node.path:

          nodes_new.append(child)

        elif child.path != node.path:

          files.append(child)

  return files

path = "mnt/silver/delta/yourfolder/"

print(f"Size of {path} in gb: {get_size_of_path(path) / 1024 / 1024 / 1024}")

And worked perfectly.

Trabaja, "casi" perfectamente. La verdad el código tiene un bug en el recorrido, el cual se corrige cambiando la línea "elif child.path != node.path:" por "else:". Adicionalmente, se puede mejorar incluyendo el envio del flag verbose.

Quedaría...

 

 

from dbruntime.dbutils import FileInfo

def get_size_of_path(path, verbose=False):

  return sum([file.size for file in get_all_files_in_path(path, verbose)])

def get_all_files_in_path(path, verbose=False):

  nodes_new = []

  nodes_new = dbutils.fs.ls(path)

  files = []

  while len(nodes_new) > 0:

    current_nodes = nodes_new

    nodes_new = []

    for node in current_nodes:

      if verbose:

        print(f"Processing {node.path}")

      children = dbutils.fs.ls(node.path)

      for child in children:

        if verbose:

          print(f"Processing {child.path} [{child.size} bytes] in {node.path}")

        if child.size == 0 and child.path != node.path:

          nodes_new.append(child)

        #elif child.path != node.path:
        else:

          files.append(child)

  return files

 

 

 

User16788316720
New Contributor III

File size is only specified for files. So, if you specify a directory as your source, you have to iterate through the directory. The below snippet should work (and should be faster than the other solutions).

import glob
 
def get_directory_size_in_bytes(source_path: dir, pattern: str = '**/*.parquet') -> int:
    source_path = '/dbfs/' + source_path.replace('dbfs','').lstrip('/').lstrip(':').rstrip('/')
 
    files = glob.glob(f'{source_path}{pattern}')
    directory_size = sum([dbutils.fs.ls(path.replace('/dbfs/',''))[0].size for path in files])
 
    return directory_size

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group