cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Retrieve file size from azure in databricks

anonymous_567
New Contributor II

Hello, I am running a job that requires reading in files of different sizes, each one representing a different dataset, and loading them into a delta table. Some files are as big as 100Gib and others as small as 500 MiB. I want to repartition each file based on their size. I am trying to run the following code in order to get the size of a file given its path in azure.  

%pip install azure-storage-blob

from azure.storage.blob import BlobServiceClient

def get_file_size(file_path๐Ÿ˜ž
    # Extract the container and blob name from the file path
    container_name = <container_name> #from file_path
    blob_name = <blob_name> #from file_path
    print(container_name)
    print(blob_name)
    # Create a BlobServiceClient
    connection_string = <connection_string>
   
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
   
    # Get the BlobClient
    blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
   
    # Get the blob properties
    blob_properties = blob_client.get_blob_properties()
   
    # Get the size of the blob in bytes
    return blob_properties.size


This code doesn't execute to completion but instead errors out at "blob properties = blob_client.get_blob_properties()". It is a ServiceRequestError that says, "Failed to establish a new connection: [Errno -2] Name or service not known."

Anybody know where to go from here if trying to retrieve file size without having to manually go into the UI?
1 REPLY 1

LindasonUk
New Contributor II

 You could try utilise the dbutils files service like this:

from pyspark.sql.functions import col, desc, input_file_name, regexp_replace

directory = 'abfss://<container>@<storage-account>.dfs.core.windows.net/path/to/data/root'
files_list = dbutils.fs.ls(directory)
all_files = []

def list_files(path):
    files = dbutils.fs.ls(path)
    for file in files:
        if file.isDir():
            list_files(file.path)
        else:
            all_files.append(file)

list_files(directory)
files_df = spark.createDataFrame(all_files)
files_df = files_df.withColumn("parent_folder", input_file_name())
files_df = files_df.withColumn(
    "path",
    regexp_replace(
        regexp_replace(
            col("path"),
            "<storage-path-prefix>",
            ""
        ),
        col('name'),
        ""
    )
)
files_sizes_df = files_df.select(
    'path',
    (col('size') / (1024 * 1024 * 1024)).alias('size_gb'),
    'name'
)
files_sizes_df = files_sizes_df.orderBy(desc("size_gb"))
display(files_sizes_df)
 
You should be able to work similar behaviour into a function and drive this with a root directory parameter.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group