- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Hi , While DESCRIBE DETAIL gives you the total sizeInBytes, it lumps everything together. Neither ANALYZE nor the Information Schema will break down the exact physical storage overhead of your Change Data Feed (CDF) or your Delta logs.
Because a Delta table is ultimately just a directory on your cloud storage, the most accurate way to do this is programmatically using Python and the Databricks SDK.
The base data, the transaction log, and the CDF data all live in separate folders (_delta_log and _change_data). You can use the SDK to cleanly fetch the table's storage location, and then sum up the sizes of those specific folders.
Here is a handy Python script you can copy and paste into a notebook to do exactly this:
from databricks.sdk import WorkspaceClient
def get_delta_storage_breakdown(catalog, schema, table_name):
# 1. Initialize the SDK client
w = WorkspaceClient()
# 2. Get the exact storage location using the SDK
full_table_name = f"{catalog}.{schema}.{table_name}"
table_info = w.tables.get(full_name=full_table_name)
base_location = table_info.storage_location
# Helper function to sum directory sizes
def get_dir_size(path):
total_size = 0
try:
for file in dbutils.fs.ls(path):
if file.isDir():
total_size += get_dir_size(file.path)
else:
total_size += file.size
except Exception as e:
pass # Directory might not exist if CDF was never enabled
return total_size
# 3. Calculate sizes for the specific Delta folders
log_size = get_dir_size(f"{base_location}/_delta_log/")
cdf_size = get_dir_size(f"{base_location}/_change_data/")
# Calculate the total table directory size, then subtract log and CDF to get base data
total_size = get_dir_size(base_location)
base_data_size = total_size - log_size - cdf_size
# 4. Print the breakdown (converting bytes to MB for readability)
print(f"--- Size Breakdown for {full_table_name} ---")
print(f"Base Data Size: {base_data_size / (1024 * 1024):.2f} MB")
print(f"Delta Log Size: {log_size / (1024 * 1024):.2f} MB")
print(f"CDF Overhead: {cdf_size / (1024 * 1024):.2f} MB")
print(f"Total Size: {total_size / (1024 * 1024):.2f} MB")
# Give it a try!
get_delta_storage_breakdown("your_catalog", "your_schema", "your_table")