Sizing Tables and delt logs/CDF

RGSLCA · Wednesday

Hi,

I need to compare the sizes of my delta tables , what's the correct approach ?

Table size reported by analyze command ? , but how do I check the delta log size , if I enable CDF .. how do I know the CDF log size(the overhead it adds) ? , kind of like how to find with and without CDF ?

Is there a dictionary table in information schema or a command ?

naveen0808 · Wednesday

DESCRIBE DETAIL is the go-to for data size but it only tells you half the story. The delta log and CDF overhead are invisible to it they sit in separate folders on storage and you have to check them directly.Think of it as three buckets for any Delta table. First is your actual data. Second is the delta log which grows quietly over time especially on high-churn tables. Third is the change data folder which only appears when CDF is enabled and captures every insert, update and delete separately.To know your real CDF overhead just compare the change data folder size against your main table size — that ratio tells you exactly what CDF is costing you in storage terms. A table with heavy updates will have a much higher CDF overhead than a mostly append-only table.

So the honest answer is: DESCRIBE DETAIL for data size, storage path inspection for everything else, and SHOW TBLPROPERTIES to confirm which tables actually have CDF switched on before you start comparing.

Ashwin_DSA · Thursday

Hi @RGSLCA,

From what I can see, there are three different things you might want to measure, and Databricks exposes them differently.

If you want the size of the current Delta table itself, the right starting point is DESCRIBE DETAIL. The sizeInBytes value is the size of the table's latest snapshot, so that is the number you can use when comparing one Delta table to another at the table level. It is not the same thing as.... all storage consumed by the table folder, including metadata history.

Try this...

DESCRIBE DETAIL catalog.schema.table_name;

If you want to understand how much data each write operation produced over time, then DESCRIBE HISTORY is useful. The operationMetrics map includes values such as numOutputBytes for write operations, which is handy for understanding growth and write amplification, but it is still not a direct "current delta log size" metric.

If you specifically want the size of the Delta log, there is no built-in SQL command or information_schema view that returns the value directly as "_delta_log bytes". The Delta log lives alongside the table data as JSON commit files and Parquet checkpoints, and the way to measure it is to get the table's storage_path and then sum the bytes under storage_path/_delta_log yourself. INFORMATION_SCHEMA.TABLES can help here because it exposes storage_path, but not log size or CDF size.

Try this to get the storage path...

SELECT table_catalog, table_schema, table_name, storage_path
FROM system.information_schema.tables
WHERE table_catalog = 'catalog'
  AND table_schema = 'schema'
  AND table_name = 'table_name';

CDF works similarly. There is no separate dictionary table or single command that tells you "this much of the table is CDF overhead". The public docs explicitly state that enabling the change data feed can cause a small storage increase because some changes may be written to separate change data files, but some operations do not generate change files at all, and Databricks computes the feed directly from the transaction log instead. So the overhead is workload-dependent rather than a fixed percentage.

Because of that, an approach to "with and without CDF" is to compare equivalent tables or equivalent workloads in an A/B setup. Compare DESCRIBE DETAIL.sizeInBytes for the snapshot view, compare total bytes under the table root, and compare bytes under _delta_log. That gives you the table-level difference, the metadata/log difference, and the overall storage difference. Also, one small but important caveat from the docs... you should not try to reconstruct CDF semantics by reading the underlying change files directly. Use the Delta APIs for reading CDF, and only use storage inspection for estimating footprint.

I'm curious to know what you aim to achieve with this exercise. Is this for reconciliation or for estimating sizes as you transition between different environments?

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

ShamenParis · Thursday

Hi , While DESCRIBE DETAIL gives you the total sizeInBytes, it lumps everything together. Neither ANALYZE nor the Information Schema will break down the exact physical storage overhead of your Change Data Feed (CDF) or your Delta logs.

Because a Delta table is ultimately just a directory on your cloud storage, the most accurate way to do this is programmatically using Python and the Databricks SDK.

The base data, the transaction log, and the CDF data all live in separate folders (_delta_log and _change_data). You can use the SDK to cleanly fetch the table's storage location, and then sum up the sizes of those specific folders.

Here is a handy Python script you can copy and paste into a notebook to do exactly this:

from databricks.sdk import WorkspaceClient

def get_delta_storage_breakdown(catalog, schema, table_name):
    # 1. Initialize the SDK client
    w = WorkspaceClient()
    
    # 2. Get the exact storage location using the SDK
    full_table_name = f"{catalog}.{schema}.{table_name}"
    table_info = w.tables.get(full_name=full_table_name)
    base_location = table_info.storage_location
    
    # Helper function to sum directory sizes
    def get_dir_size(path):
        total_size = 0
        try:
            for file in dbutils.fs.ls(path):
                if file.isDir():
                    total_size += get_dir_size(file.path)
                else:
                    total_size += file.size
        except Exception as e:
            pass # Directory might not exist if CDF was never enabled
        return total_size

    # 3. Calculate sizes for the specific Delta folders
    log_size = get_dir_size(f"{base_location}/_delta_log/")
    cdf_size = get_dir_size(f"{base_location}/_change_data/")
    
    # Calculate the total table directory size, then subtract log and CDF to get base data
    total_size = get_dir_size(base_location)
    base_data_size = total_size - log_size - cdf_size
    
    # 4. Print the breakdown (converting bytes to MB for readability)
    print(f"--- Size Breakdown for {full_table_name} ---")
    print(f"Base Data Size: {base_data_size / (1024 * 1024):.2f} MB")
    print(f"Delta Log Size: {log_size / (1024 * 1024):.2f} MB")
    print(f"CDF Overhead:   {cdf_size / (1024 * 1024):.2f} MB")
    print(f"Total Size:     {total_size / (1024 * 1024):.2f} MB")

# Give it a try!
get_delta_storage_breakdown("your_catalog", "your_schema", "your_table")

Vikram10 · Friday

Hi @RGSLCA
DESCRIBE DETAIL is the best starting point if you're comparing Delta table sizes, but it's important to understand what it reports. The sizeInBytes value represents only the latest active snapshot of the table, not the total storage consumed by the table on cloud storage.

If you're looking at the backing storage (ADLS, S3, GCS), the actual footprint can be significantly larger because it may also contain:

Historical Parquet files retained for time travel
Tombstoned files awaiting VACUUM
Delta transaction log files (_delta_log)
Change Data Feed files (_change_data) when CDF is enabled

There isn't currently a built-in SQL command or Information Schema view that directly reports Delta log size or CDF overhead. To measure those, you'll need to inspect the table's storage location and calculate the sizes of the _delta_log and _change_data folders separately.

One thing to keep in mind: CDF overhead is workload-dependent. Some changes are materialized as files under _change_data, while others can be derived from the transaction log, so there isn't a simple fixed percentage overhead.

In practice, I usually use DESCRIBE DETAIL for comparing the active table size and then validate the actual storage consumption at the cloud storage level, especially for tables with frequent updates/merges or where VACUUM has not been run recently.