cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Retrieve file size from azure in databricks

anonymous_567
New Contributor II

Hello, I am running a job that requires reading in files of different sizes, each one representing a different dataset, and loading them into a delta table. Some files are as big as 100Gib and others as small as 500 MiB. I want to repartition each file based on their size. I am trying to run the following code in order to get the size of a file given its path in azure.  

%pip install azure-storage-blob

from azure.storage.blob import BlobServiceClient

def get_file_size(file_path😞
    # Extract the container and blob name from the file path
    container_name = <container_name> #from file_path
    blob_name = <blob_name> #from file_path
    print(container_name)
    print(blob_name)
    # Create a BlobServiceClient
    connection_string = <connection_string>
   
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
   
    # Get the BlobClient
    blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
   
    # Get the blob properties
    blob_properties = blob_client.get_blob_properties()
   
    # Get the size of the blob in bytes
    return blob_properties.size


This code doesn't execute to completion but instead errors out at "blob properties = blob_client.get_blob_properties()". It is a ServiceRequestError that says, "Failed to establish a new connection: [Errno -2] Name or service not known."

Anybody know where to go from here if trying to retrieve file size without having to manually go into the UI?
1 REPLY 1

LindasonUk
New Contributor III

 You could try utilise the dbutils files service like this:

from pyspark.sql.functions import col, desc, input_file_name, regexp_replace

directory = 'abfss://<container>@<storage-account>.dfs.core.windows.net/path/to/data/root'
files_list = dbutils.fs.ls(directory)
all_files = []

def list_files(path):
    files = dbutils.fs.ls(path)
    for file in files:
        if file.isDir():
            list_files(file.path)
        else:
            all_files.append(file)

list_files(directory)
files_df = spark.createDataFrame(all_files)
files_df = files_df.withColumn("parent_folder", input_file_name())
files_df = files_df.withColumn(
    "path",
    regexp_replace(
        regexp_replace(
            col("path"),
            "<storage-path-prefix>",
            ""
        ),
        col('name'),
        ""
    )
)
files_sizes_df = files_df.select(
    'path',
    (col('size') / (1024 * 1024 * 1024)).alias('size_gb'),
    'name'
)
files_sizes_df = files_sizes_df.orderBy(desc("size_gb"))
display(files_sizes_df)
 
You should be able to work similar behaviour into a function and drive this with a root directory parameter.