<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to get the total directory size using dbutils in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27286#M19163</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Is there a way to get the directory size in ADLS(gen2) using dbutils in databricks?&lt;/P&gt;
&lt;P&gt;If I run this &lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;dbutils.fs.ls("/mnt/abc/xyz")&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;I get the file sizes inside the xyz folder( there are about 5000 files), I want to get the size of the XYZ folder&lt;/P&gt;
&lt;P&gt;how can I achieve this, any help is appreciated&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 05 Feb 2020 20:57:42 GMT</pubDate>
    <dc:creator>gtaspark</dc:creator>
    <dc:date>2020-02-05T20:57:42Z</dc:date>
    <item>
      <title>How to get the total directory size using dbutils</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27286#M19163</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Is there a way to get the directory size in ADLS(gen2) using dbutils in databricks?&lt;/P&gt;
&lt;P&gt;If I run this &lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;dbutils.fs.ls("/mnt/abc/xyz")&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;I get the file sizes inside the xyz folder( there are about 5000 files), I want to get the size of the XYZ folder&lt;/P&gt;
&lt;P&gt;how can I achieve this, any help is appreciated&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 05 Feb 2020 20:57:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27286#M19163</guid>
      <dc:creator>gtaspark</dc:creator>
      <dc:date>2020-02-05T20:57:42Z</dc:date>
    </item>
    <item>
      <title>Re: How to get the total directory size using dbutils</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27287#M19164</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hi @gtaspark,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;please size command to get size as in below docs,&lt;P&gt;&lt;/P&gt;&lt;A href="https://docs.databricks.com/dev-tools/databricks-utils.html#dbutilsfsls-command" target="_blank"&gt;https://docs.databricks.com/dev-tools/databricks-utils.html#dbutilsfsls-command&lt;/A&gt;</description>
      <pubDate>Mon, 17 Feb 2020 07:43:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27287#M19164</guid>
      <dc:creator>shyam_9</dc:creator>
      <dc:date>2020-02-17T07:43:04Z</dc:date>
    </item>
    <item>
      <title>Re: How to get the total directory size using dbutils</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27288#M19165</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Tthere is no &lt;PRE&gt;&lt;CODE&gt;size&lt;/CODE&gt;&lt;/PRE&gt; command, and &lt;PRE&gt;&lt;CODE&gt;ls&lt;/CODE&gt;&lt;/PRE&gt; returns 0 for directories.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 01 May 2020 04:01:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27288#M19165</guid>
      <dc:creator>DimitriBlyumin</dc:creator>
      <dc:date>2020-05-01T04:01:11Z</dc:date>
    </item>
    <item>
      <title>Re: How to get the total directory size using dbutils</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27289#M19166</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;@gtaspark &lt;/P&gt;
&lt;P&gt;%scala &lt;/P&gt;
&lt;P&gt;val path="/mnt/abc/xyz" &lt;/P&gt;
&lt;P&gt; val filelist=dbutils.fs.ls(path) &lt;/P&gt;
&lt;P&gt; val df = filelist.toDF() df.createOrReplaceTempView("adlsSize") &lt;/P&gt;
&lt;P&gt;spark.sql("select sum(size)/(1024*1024*1024) as sizeInGB from adlsSize").show()&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 18 Feb 2021 15:12:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27289#M19166</guid>
      <dc:creator>UmakanthSingalr</dc:creator>
      <dc:date>2021-02-18T15:12:06Z</dc:date>
    </item>
    <item>
      <title>Re: How to get the total directory size using dbutils</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27290#M19167</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I want to calculate a directory(e.g- XYZ) size which contains sub folders and sub files. I want total size of all the files and everything inside XYZ.&lt;/P&gt;
&lt;P&gt;I could find out all the folders inside a particular path. But I want size of all together. Also I see&lt;/P&gt;
&lt;P&gt;&lt;PRE&gt;&lt;CODE&gt;display(dbutils.fs.ls("/mnt/datalake/.../XYZ/.../abc.parquet"))&lt;/CODE&gt;&lt;/PRE&gt;&lt;/P&gt;
&lt;P&gt;gives me data size of abc file. But I want complete size of XYZ.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Feb 2021 11:37:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27290#M19167</guid>
      <dc:creator>Breitenberg</dc:creator>
      <dc:date>2021-02-19T11:37:09Z</dc:date>
    </item>
    <item>
      <title>Re: How to get the total directory size using dbutils</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27291#M19168</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;You can use the disk usage unix command in a notebook in order to get the size. As you might know, any dbfs directory has mount on the unix system as well and you can access it using /dbfs.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%sh du -h /dbfs/mnt/abc/xyz&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 23 Nov 2021 23:32:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27291#M19168</guid>
      <dc:creator>Hari_Gopinath</dc:creator>
      <dc:date>2021-11-23T23:32:34Z</dc:date>
    </item>
    <item>
      <title>Re: How to get the total directory size using dbutils</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27292#M19169</link>
      <description>&lt;P&gt;dbutils.fs.ls("/tmp") should give you size. @gtaspark​&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 27 Nov 2021 15:26:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27292#M19169</guid>
      <dc:creator>Atanu</dc:creator>
      <dc:date>2021-11-27T15:26:25Z</dc:date>
    </item>
    <item>
      <title>Re: How to get the total directory size using dbutils</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27293#M19170</link>
      <description>&lt;P&gt;I have found this on internet:&lt;/P&gt;&lt;P&gt;from dbruntime.dbutils import FileInfo&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;def get_size_of_path(path):&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;return sum([file.size for file in get_all_files_in_path(path)])&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;def get_all_files_in_path(path, verbose=False):&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;nodes_new = []&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;nodes_new = dbutils.fs.ls(path)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;files = []&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;while len(nodes_new) &amp;gt; 0:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;current_nodes = nodes_new&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;nodes_new = []&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;for node in current_nodes:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;if verbose:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;print(f"Processing {node.path}")&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;children = dbutils.fs.ls(node.path)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;for child in children:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;if child.size == 0 and child.path != node.path:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;nodes_new.append(child)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;elif child.path != node.path:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;files.append(child)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;return files&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;path = "mnt/silver/delta/yourfolder/"&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;print(f"Size of {path} in gb: {get_size_of_path(path) / 1024 / 1024 / 1024}")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;And worked perfectly.&lt;/P&gt;</description>
      <pubDate>Fri, 13 Jan 2023 12:19:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27293#M19170</guid>
      <dc:creator>JonathanCastro</dc:creator>
      <dc:date>2023-01-13T12:19:37Z</dc:date>
    </item>
    <item>
      <title>Re: How to get the total directory size using dbutils</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27294#M19171</link>
      <description>&lt;P&gt;File size is only specified for files. So, if you specify a directory as your source, you have to iterate through the directory. The below snippet should work (and should be faster than the other solutions).&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;import glob
&amp;nbsp;
def get_directory_size_in_bytes(source_path: dir, pattern: str = '**/*.parquet') -&amp;gt; int:
    source_path = '/dbfs/' + source_path.replace('dbfs','').lstrip('/').lstrip(':').rstrip('/')
&amp;nbsp;
    files = glob.glob(f'{source_path}{pattern}')
    directory_size = sum([dbutils.fs.ls(path.replace('/dbfs/',''))[0].size for path in files])
&amp;nbsp;
    return directory_size&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 21 Jun 2023 17:22:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/27294#M19171</guid>
      <dc:creator>User16788316720</dc:creator>
      <dc:date>2023-06-21T17:22:51Z</dc:date>
    </item>
    <item>
      <title>Re: How to get the total directory size using dbutils</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/72268#M34537</link>
      <description>&lt;P&gt;Trabaja, "casi" perfectamente. La verdad el código tiene un bug en el recorrido, el cual se corrige cambiando la línea "&lt;SPAN&gt;elif child.path != node.path:" por "else:". Adicionalmente, se puede mejorar incluyendo el envio del flag verbose.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Quedaría...&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from dbruntime.dbutils import FileInfo

def get_size_of_path(path, verbose=False):

  return sum([file.size for file in get_all_files_in_path(path, verbose)])

def get_all_files_in_path(path, verbose=False):

  nodes_new = []

  nodes_new = dbutils.fs.ls(path)

  files = []

  while len(nodes_new) &amp;gt; 0:

    current_nodes = nodes_new

    nodes_new = []

    for node in current_nodes:

      if verbose:

        print(f"Processing {node.path}")

      children = dbutils.fs.ls(node.path)

      for child in children:

        if verbose:

          print(f"Processing {child.path} [{child.size} bytes] in {node.path}")

        if child.size == 0 and child.path != node.path:

          nodes_new.append(child)

        #elif child.path != node.path:
        else:

          files.append(child)

  return files&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Jun 2024 20:48:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-the-total-directory-size-using-dbutils/m-p/72268#M34537</guid>
      <dc:creator>AgileThought</dc:creator>
      <dc:date>2024-06-10T20:48:01Z</dc:date>
    </item>
  </channel>
</rss>

