02-05-2020 12:57 PM
Is there a way to get the directory size in ADLS(gen2) using dbutils in databricks?
If I run this
dbutils.fs.ls("/mnt/abc/xyz")
I get the file sizes inside the xyz folder( there are about 5000 files), I want to get the size of the XYZ folder
how can I achieve this, any help is appreciated
11-27-2021 07:26 AM
02-16-2020 11:43 PM
Hi @gtaspark,
please size command to get size as in below docs,https://docs.databricks.com/dev-tools/databricks-utils.html#dbutilsfsls-command04-30-2020 09:01 PM
Tthere is no
size
command, and ls
returns 0 for directories.
02-18-2021 07:12 AM
@gtaspark
%scala
val path="/mnt/abc/xyz"
val filelist=dbutils.fs.ls(path)
val df = filelist.toDF() df.createOrReplaceTempView("adlsSize")
spark.sql("select sum(size)/(1024*1024*1024) as sizeInGB from adlsSize").show()
02-19-2021 03:37 AM
I want to calculate a directory(e.g- XYZ) size which contains sub folders and sub files. I want total size of all the files and everything inside XYZ.
I could find out all the folders inside a particular path. But I want size of all together. Also I see
display(dbutils.fs.ls("/mnt/datalake/.../XYZ/.../abc.parquet"))
gives me data size of abc file. But I want complete size of XYZ.
11-23-2021 03:32 PM
Hi,
You can use the disk usage unix command in a notebook in order to get the size. As you might know, any dbfs directory has mount on the unix system as well and you can access it using /dbfs.
%sh du -h /dbfs/mnt/abc/xyz
11-27-2021 07:26 AM
dbutils.fs.ls("/tmp") should give you size. @gtaspark
01-13-2023 04:19 AM
I have found this on internet:
from dbruntime.dbutils import FileInfo
def get_size_of_path(path):
return sum([file.size for file in get_all_files_in_path(path)])
def get_all_files_in_path(path, verbose=False):
nodes_new = []
nodes_new = dbutils.fs.ls(path)
files = []
while len(nodes_new) > 0:
current_nodes = nodes_new
nodes_new = []
for node in current_nodes:
if verbose:
print(f"Processing {node.path}")
children = dbutils.fs.ls(node.path)
for child in children:
if child.size == 0 and child.path != node.path:
nodes_new.append(child)
elif child.path != node.path:
files.append(child)
return files
path = "mnt/silver/delta/yourfolder/"
print(f"Size of {path} in gb: {get_size_of_path(path) / 1024 / 1024 / 1024}")
And worked perfectly.
06-21-2023 10:22 AM
File size is only specified for files. So, if you specify a directory as your source, you have to iterate through the directory. The below snippet should work (and should be faster than the other solutions).
import glob
def get_directory_size_in_bytes(source_path: dir, pattern: str = '**/*.parquet') -> int:
source_path = '/dbfs/' + source_path.replace('dbfs','').lstrip('/').lstrip(':').rstrip('/')
files = glob.glob(f'{source_path}{pattern}')
directory_size = sum([dbutils.fs.ls(path.replace('/dbfs/',''))[0].size for path in files])
return directory_size
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.