Retrieve file size from azure in databricks
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-03-2024 08:12 AM
Hello, I am running a job that requires reading in files of different sizes, each one representing a different dataset, and loading them into a delta table. Some files are as big as 100Gib and others as small as 500 MiB. I want to repartition each file based on their size. I am trying to run the following code in order to get the size of a file given its path in azure.
%pip install azure-storage-blob
from azure.storage.blob import BlobServiceClient
def get_file_size(file_path😞
# Extract the container and blob name from the file path
container_name = <container_name> #from file_path
blob_name = <blob_name> #from file_path
print(container_name)
print(blob_name)
# Create a BlobServiceClient
connection_string = <connection_string>
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
# Get the BlobClient
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
# Get the blob properties
blob_properties = blob_client.get_blob_properties()
# Get the size of the blob in bytes
return blob_properties.size
This code doesn't execute to completion but instead errors out at "blob properties = blob_client.get_blob_properties()". It is a ServiceRequestError that says, "Failed to establish a new connection: [Errno -2] Name or service not known."
Anybody know where to go from here if trying to retrieve file size without having to manually go into the UI?
1 REPLY 1
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-14-2024 06:20 AM
You could try utilise the dbutils files service like this:
from pyspark.sql.functions import col, desc, input_file_name, regexp_replace
directory = 'abfss://<container>@<storage-account>.dfs.core.windows.net/path/to/data/root'
files_list = dbutils.fs.ls(directory)
all_files = []
def list_files(path):
files = dbutils.fs.ls(path)
for file in files:
if file.isDir():
list_files(file.path)
else:
all_files.append(file)
list_files(directory)
files_df = spark.createDataFrame(all_files)
files_df = files_df.withColumn("parent_folder", input_file_name())
files_df = files_df.withColumn(
"path",
regexp_replace(
regexp_replace(
col("path"),
"<storage-path-prefix>",
""
),
col('name'),
""
)
)
files_sizes_df = files_df.select(
'path',
(col('size') / (1024 * 1024 * 1024)).alias('size_gb'),
'name'
)
files_sizes_df = files_sizes_df.orderBy(desc("size_gb"))
display(files_sizes_df)
You should be able to work similar behaviour into a function and drive this with a root directory parameter.

