cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to get the total directory size using dbutils

gtaspark
New Contributor II

Is there a way to get the directory size in ADLS(gen2) using dbutils in databricks?

If I run this

dbutils.fs.ls("/mnt/abc/xyz")

I get the file sizes inside the xyz folder( there are about 5000 files), I want to get the size of the XYZ folder

how can I achieve this, any help is appreciated

1 ACCEPTED SOLUTION

Accepted Solutions

Atanu
Esteemed Contributor
Esteemed Contributor

dbutils.fs.ls("/tmp") should give you size. @gtaspark​ 

View solution in original post

8 REPLIES 8

shyam_9
Valued Contributor
Valued Contributor

Hi @gtaspark,

please size command to get size as in below docs,

https://docs.databricks.com/dev-tools/databricks-utils.html#dbutilsfsls-command

Tthere is no

size
command, and
ls
returns 0 for directories.

UmakanthSingalr
New Contributor II

@gtaspark

%scala

val path="/mnt/abc/xyz"

val filelist=dbutils.fs.ls(path)

val df = filelist.toDF() df.createOrReplaceTempView("adlsSize")

spark.sql("select sum(size)/(1024*1024*1024) as sizeInGB from adlsSize").show()

Breitenberg
New Contributor II

I want to calculate a directory(e.g- XYZ) size which contains sub folders and sub files. I want total size of all the files and everything inside XYZ.

I could find out all the folders inside a particular path. But I want size of all together. Also I see

display(dbutils.fs.ls("/mnt/datalake/.../XYZ/.../abc.parquet"))

gives me data size of abc file. But I want complete size of XYZ.

Hari_Gopinath
New Contributor II

Hi,

You can use the disk usage unix command in a notebook in order to get the size. As you might know, any dbfs directory has mount on the unix system as well and you can access it using /dbfs.

%sh du -h /dbfs/mnt/abc/xyz

Atanu
Esteemed Contributor
Esteemed Contributor

dbutils.fs.ls("/tmp") should give you size. @gtaspark​ 

jcastro
New Contributor II

I have found this on internet:

from dbruntime.dbutils import FileInfo

def get_size_of_path(path):

  return sum([file.size for file in get_all_files_in_path(path)])

def get_all_files_in_path(path, verbose=False):

  nodes_new = []

  nodes_new = dbutils.fs.ls(path)

  files = []

  while len(nodes_new) > 0:

    current_nodes = nodes_new

    nodes_new = []

    for node in current_nodes:

      if verbose:

        print(f"Processing {node.path}")

      children = dbutils.fs.ls(node.path)

      for child in children:

        if child.size == 0 and child.path != node.path:

          nodes_new.append(child)

        elif child.path != node.path:

          files.append(child)

  return files

path = "mnt/silver/delta/yourfolder/"

print(f"Size of {path} in gb: {get_size_of_path(path) / 1024 / 1024 / 1024}")

And worked perfectly.

User16788316720
New Contributor III

File size is only specified for files. So, if you specify a directory as your source, you have to iterate through the directory. The below snippet should work (and should be faster than the other solutions).

import glob
 
def get_directory_size_in_bytes(source_path: dir, pattern: str = '**/*.parquet') -> int:
    source_path = '/dbfs/' + source_path.replace('dbfs','').lstrip('/').lstrip(':').rstrip('/')
 
    files = glob.glob(f'{source_path}{pattern}')
    directory_size = sum([dbutils.fs.ls(path.replace('/dbfs/',''))[0].size for path in files])
 
    return directory_size

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.