02-05-2020 12:57 PM
Is there a way to get the directory size in ADLS(gen2) using dbutils in databricks?
If I run this
dbutils.fs.ls("/mnt/abc/xyz")
I get the file sizes inside the xyz folder( there are about 5000 files), I want to get the size of the XYZ folder
how can I achieve this, any help is appreciated
11-27-2021 07:26 AM
02-16-2020 11:43 PM
Hi @gtaspark,
please size command to get size as in below docs,https://docs.databricks.com/dev-tools/databricks-utils.html#dbutilsfsls-command04-30-2020 09:01 PM
Tthere is no
size
command, and ls
returns 0 for directories.
02-18-2021 07:12 AM
@gtaspark
%scala
val path="/mnt/abc/xyz"
val filelist=dbutils.fs.ls(path)
val df = filelist.toDF() df.createOrReplaceTempView("adlsSize")
spark.sql("select sum(size)/(1024*1024*1024) as sizeInGB from adlsSize").show()
02-19-2021 03:37 AM
I want to calculate a directory(e.g- XYZ) size which contains sub folders and sub files. I want total size of all the files and everything inside XYZ.
I could find out all the folders inside a particular path. But I want size of all together. Also I see
display(dbutils.fs.ls("/mnt/datalake/.../XYZ/.../abc.parquet"))
gives me data size of abc file. But I want complete size of XYZ.
11-23-2021 03:32 PM
Hi,
You can use the disk usage unix command in a notebook in order to get the size. As you might know, any dbfs directory has mount on the unix system as well and you can access it using /dbfs.
%sh du -h /dbfs/mnt/abc/xyz
11-27-2021 07:26 AM
dbutils.fs.ls("/tmp") should give you size. @gtaspark
01-13-2023 04:19 AM
I have found this on internet:
from dbruntime.dbutils import FileInfo
def get_size_of_path(path):
return sum([file.size for file in get_all_files_in_path(path)])
def get_all_files_in_path(path, verbose=False):
nodes_new = []
nodes_new = dbutils.fs.ls(path)
files = []
while len(nodes_new) > 0:
current_nodes = nodes_new
nodes_new = []
for node in current_nodes:
if verbose:
print(f"Processing {node.path}")
children = dbutils.fs.ls(node.path)
for child in children:
if child.size == 0 and child.path != node.path:
nodes_new.append(child)
elif child.path != node.path:
files.append(child)
return files
path = "mnt/silver/delta/yourfolder/"
print(f"Size of {path} in gb: {get_size_of_path(path) / 1024 / 1024 / 1024}")
And worked perfectly.
06-10-2024 01:47 PM - edited 06-10-2024 01:48 PM
Trabaja, "casi" perfectamente. La verdad el código tiene un bug en el recorrido, el cual se corrige cambiando la línea "elif child.path != node.path:" por "else:". Adicionalmente, se puede mejorar incluyendo el envio del flag verbose.
Quedaría...
from dbruntime.dbutils import FileInfo
def get_size_of_path(path, verbose=False):
return sum([file.size for file in get_all_files_in_path(path, verbose)])
def get_all_files_in_path(path, verbose=False):
nodes_new = []
nodes_new = dbutils.fs.ls(path)
files = []
while len(nodes_new) > 0:
current_nodes = nodes_new
nodes_new = []
for node in current_nodes:
if verbose:
print(f"Processing {node.path}")
children = dbutils.fs.ls(node.path)
for child in children:
if verbose:
print(f"Processing {child.path} [{child.size} bytes] in {node.path}")
if child.size == 0 and child.path != node.path:
nodes_new.append(child)
#elif child.path != node.path:
else:
files.append(child)
return files
06-21-2023 10:22 AM
File size is only specified for files. So, if you specify a directory as your source, you have to iterate through the directory. The below snippet should work (and should be faster than the other solutions).
import glob
def get_directory_size_in_bytes(source_path: dir, pattern: str = '**/*.parquet') -> int:
source_path = '/dbfs/' + source_path.replace('dbfs','').lstrip('/').lstrip(':').rstrip('/')
files = glob.glob(f'{source_path}{pattern}')
directory_size = sum([dbutils.fs.ls(path.replace('/dbfs/',''))[0].size for path in files])
return directory_size
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group