Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @RGSLCA,

From what I can see, there are three different things you might want to measure, and Databricks exposes them differently.

If you want the size of the current Delta table itself, the right starting point is DESCRIBE DETAIL. The sizeInBytes value is the size of the table's latest snapshot, so that is the number you can use when comparing one Delta table to another at the table level. It is not the same thing as.... all storage consumed by the table folder, including metadata history.

Try this... 

DESCRIBE DETAIL catalog.schema.table_name;

If you want to understand how much data each write operation produced over time, then DESCRIBE HISTORY is useful. The operationMetrics map includes values such as numOutputBytes for write operations, which is handy for understanding growth and write amplification, but it is still not a direct "current delta log size" metric.

If you specifically want the size of the Delta log, there is no built-in SQL command or information_schema view that returns the value directly as "_delta_log bytes". The Delta log lives alongside the table data as JSON commit files and Parquet checkpoints, and the way to measure it is to get the table's storage_path and then sum the bytes under storage_path/_delta_log yourself. INFORMATION_SCHEMA.TABLES can help here because it exposes storage_path, but not log size or CDF size.

Try this to get the storage path...

SELECT table_catalog, table_schema, table_name, storage_path
FROM system.information_schema.tables
WHERE table_catalog = 'catalog'
  AND table_schema = 'schema'
  AND table_name = 'table_name';

CDF works similarly. There is no separate dictionary table or single command that tells you "this much of the table is CDF overhead". The public docs explicitly state that enabling the change data feed can cause a small storage increase because some changes may be written to separate change data files, but some operations do not generate change files at all, and Databricks computes the feed directly from the transaction log instead. So the overhead is workload-dependent rather than a fixed percentage.

Because of that, an approach to "with and without CDF" is to compare equivalent tables or equivalent workloads in an A/B setup. Compare DESCRIBE DETAIL.sizeInBytes for the snapshot view, compare total bytes under the table root, and compare bytes under _delta_log. That gives you the table-level difference, the metadata/log difference, and the overall storage difference. Also, one small but important caveat from the docs... you should not try to reconstruct CDF semantics by reading the underlying change files directly. Use the Delta APIs for reading CDF, and only use storage inspection for estimating footprint.

I'm curious to know what you aim to achieve with this exercise. Is this for reconciliation or for estimating sizes as you transition between different environments?

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***