Databricks Community

notwarte · a week ago

Hi,

I am using Azure and I do have predictive optimization enable on the catalog. I have wrote a script to calculate the data amounts of all of the tables - looping over all of the tables and running "describe detail".

All of the tables amount to ~ 1.5 TB. Running the selection statistics on the catalog using Azure Storage Explorer returns around 6TB.

What might be causing this, how can I drill down the storage amounts?

I do have quite a few big materialized views (some with billions of rows), the vacuum retention is 7days.

Isi · a week ago

Hey @notwarte ,

Maybe I can help you.

There are several likely factors contributing to this:

Predictive Optimization runs background jobs (OPTIMIZE, VACUM, ANALYZE) that rewrite data files to improve query performance. These optimized files coexist with the older files until vacuum cleans them up. This means that for a period of time, both the new and old data files are stored, increasing storage usage.
The DESCRIBE DETAIL command only reflects the current active data files. It does not include old files that have been replaced or marked for deletion. But Azure Storage Explorer shows everything that exists physically in the storage, including older data files, transaction logs, and temporary files.
Materialized views are also stored as physical Delta tables. They may not be included in your script if it only loops over regular tables. Since these views are refreshed automatically and can contain a lot of data, they could be taking up a substantial amount of space.
Temporary or intermediate files left behind from operations like OPTIMIZE, MERGE, or COPY INTO can also accumulate if they are not cleaned up, especially if jobs fail or get interrupted.

To investigate further, I suggest:

Make sure your script includes materialized views in addition to regular tables.
Use Azure Storage Explorer to sort folders by size or modification date. This will help you identify which datasets are generating the most residual data.
Run VACUUM with the DRY RUN option on key tables to see how many stale files are still there
Consider lowering the vacuum retention period if your use case allows it, especially for fast-changing tables.

Best, 🙂

Isi

notwarte · a week ago

Hi @Isi ,

thank you for your answer. I cannot run "describe detail" on materialized views, that's they were not included in the script. Is there other way to find out their size? Most tables are append only, there are not that many stale files

notwarte · a week ago

I have used __databricks_internal catalog to find the underlying location of the materialized views. It turns out they are the ones which weighing so much. I was suspecting it can be them, but thank you for tipping me @Isi.

I did some more digging and the silver_raw table which is a streaming dlt table weights 30GB, silver publish materialized view, which is based on that silver_raw table, weights almost 1TB. I guess Databricks is smart enough not to store the same data twice, or I am missing something ?

Isi · Friday

Hey @notwarte,

Using the __databricks_internal catalog to trace the underlying storage location is a solid approach for investigating their footprint.

Regarding your question about storage duplication: yes, materialized views in Databricks do store a physical representation of the data independently from the source tables. Even if a materialized view is built on top of a Delta Live Table (like your silver_raw), Databricks does not re-use the same storage files. Instead, it maintains its own optimized version of the result set defined by the materialized view.

So if your silver_raw table is 30GB, and the silver_publish materialized view weighs almost 1TB, that suggests one of the following:

The transformation logic in the view significantly expands the data (e.g., joins, explodes, denormalization).
The view is accumulating data due to the way it is refreshed or retained.

Keep in mind that materialized views are optimized for query performance, not necessarily for storage efficiency. If the view is auto-refreshed and not using a proper partitioning or cleanup mechanism, it may grow continuously.

Recommendations:

Review whether the materialized view is necessary, or if the same goal could be achieved via an incremental table maintained with DLT or MERGE.
If you keep it, make sure it is partitioned appropriately and has some lifecycle or retention management in place.
Monitor the storage usage and refresh logic regularly, especially if you’re on S3 or ADLS where storage costs matter.

If you believe this answer is correct, please mark it as the solution for future users.

Hope this helps 🙂

Isi