Hello, I am running a job that requires reading in files of different sizes, each one representing a different dataset, and loading them into a delta table. Some files are as big as 100Gib and others as small as 500 MiB. I want to repartition each file based on their size. I am trying to run the following code in order to get the size of a file given its path in azure.
%pip install azure-storage-blob
from azure.storage.blob import BlobServiceClient
def get_file_size(file_path๐
# Extract the container and blob name from the file path
container_name = <container_name> #from file_path
blob_name = <blob_name> #from file_path
print(container_name)
print(blob_name)
# Create a BlobServiceClient
connection_string = <connection_string>
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
# Get the BlobClient
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
# Get the blob properties
blob_properties = blob_client.get_blob_properties()
# Get the size of the blob in bytes
return blob_properties.size
This code doesn't execute to completion but instead errors out at "blob properties = blob_client.get_blob_properties()". It is a ServiceRequestError that says, "
Failed to establish a new connection: [Errno -2] Name or service not known."
Anybody know where to go from here if trying to retrieve file size without having to manually go into the UI?