Databricks Community

anonymous_567 · ‎10-03-2024

Hello, I am running a job that requires reading in files of different sizes, each one representing a different dataset, and loading them into a delta table. Some files are as big as 100Gib and others as small as 500 MiB. I want to repartition each file based on their size. I am trying to run the following code in order to get the size of a file given its path in azure.

%pip install azure-storage-blob

from azure.storage.blob import BlobServiceClient

def get_file_size(file_path😞

# Extract the container and blob name from the file path

container_name = <container_name> #from file_path

blob_name = <blob_name> #from file_path

print(container_name)

print(blob_name)

# Create a BlobServiceClient

connection_string = <connection_string>

blob_service_client = BlobServiceClient.from_connection_string(connection_string)

# Get the BlobClient

blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

# Get the blob properties

blob_properties = blob_client.get_blob_properties()

# Get the size of the blob in bytes

return blob_properties.size

This code doesn't execute to completion but instead errors out at "blob properties = blob_client.get_blob_properties()". It is a ServiceRequestError that says, "Failed to establish a new connection: [Errno -2] Name or service not known."

Anybody know where to go from here if trying to retrieve file size without having to manually go into the UI?

LindasonUk · ‎10-14-2024

You could try utilise the dbutils files service like this:

from pyspark.sql.functions import col, desc, input_file_name, regexp_replace

directory = 'abfss://<container>@<storage-account>.dfs.core.windows.net/path/to/data/root'

files_list = dbutils.fs.ls(directory)

all_files = []

def list_files(path):

files = dbutils.fs.ls(path)

for file in files:

if file.isDir():

list_files(file.path)

else:

all_files.append(file)

list_files(directory)

files_df = spark.createDataFrame(all_files)

files_df = files_df.withColumn("parent_folder", input_file_name())

files_df = files_df.withColumn(

"path",

regexp_replace(

col("path"),

"<storage-path-prefix>",

""

),

col('name'),

""

)

files_sizes_df = files_df.select(

'path',

(col('size') / (1024 * 1024 * 1024)).alias('size_gb'),

'name'

)

files_sizes_df = files_sizes_df.orderBy(desc("size_gb"))

display(files_sizes_df)

You should be able to work similar behaviour into a function and drive this with a root directory parameter.

Databricks Community

Retrieve file size from azure in databricks

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!