Databricks Community

Devsql · ‎05-20-2024

Hi Team,

Currently I am trying to find size of all tables in my Azure databricks, as i am trying to get idea of current data loading trends, so i can plan for data forecast ( i.e. Last 2 months, approx 100 GB data came-in, so in next 2-3 months there should be 150 GB data coming-in)

My production Azure databricks environment is using Unity Catalog, which hosts:

a- All Bronze Tables

b- All Silver Tables

c- All Gold Tables.

d- Some extra Delta-Live-Tables, acting as Temp table holding results of intermediate calculation.

e- Some tables made via EXCEL sheet data.

Above tables are Delta-Live-Tables, made via DLT based pipelines/Jobs.

So i am looking for a script/code/solution which gives me total size in GB for all tables in given database.

Solution based on SQL is good one but even having answer based on Python/scala would be ok.

Also in tradition Relation-DBMS world, there used to be several built-in reports which gives idea of data loading trends via Charts or some graphs...So do we have such built-in feature with Azure Databricks ?

Thanks in Advance

Devsql

Devsql · ‎06-21-2024

Hi @Retired_mod,

1- Regarding this issue i had found below link:

https://kb.databricks.com/sql/find-size-of-table#:~:text=You%20can%20determine%20the%20size,stats%20...

Now to try above link, I need to decide: Delta-Table Vs Non-Delta-Table.

Few of my tables are Materialized Views while some are Streaming Tables. So I am Not able to understand about which one to try: Delta-Table Vs Non-Delta-Table.

Can you please help me in that ?

2- Also in above link, for Delta-Table, for below line of code:

DeltaLog.forTable(spark, "dbfs:/<path-to-delta-table>")

I do Not know path: "dbfs:/<path-to-delta-table>".

From where I can get this path, UI Or Command ?

As i am working for corporate company, Admin have given only necessary (minimum) access. So if you guide me than I can ask Admin for permission.

3- When I opened your above given link and tried your solution, I got below error:

Cannot query a Streaming Table `DB`.`Schema`.`Table_append_raw` from an Assigned or No isolation shared cluster, please use a SHARED cluster or a Databricks SQL warehouse instead.

So from above I understood that we need to use SHARED cluster. Is there any other option we have ?

Thank you

Devsql