cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Measure size of all tables in Azure databricks

Devsql
New Contributor III

Hi Team,

Currently I am trying to find size of all tables in my Azure databricks, as i am trying to get idea of current data loading trends, so i can plan for data forecast ( i.e. Last 2 months, approx 100 GB data came-in, so in next 2-3 months there should be 150 GB data coming-in)

My production Azure databricks environment is using Unity Catalog, which hosts:

a- All Bronze Tables

b- All Silver Tables

c- All Gold Tables.

d- Some extra Delta-Live-Tables, acting as Temp table holding results of intermediate calculation.

e- Some tables made via EXCEL sheet data.

Above tables are Delta-Live-Tables, made via DLT based pipelines/Jobs.

So i am looking for a script/code/solution which gives me total size in GB for all tables in given database.

Solution based on SQL is good one but even having answer based on Python/scala would be ok.

Also in tradition Relation-DBMS world, there used to be several built-in reports which gives idea of data loading trends via Charts or some graphs...So do we have such built-in feature with Azure Databricks ?

Thanks in Advance

Devsql

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Devsql

  1. For delta tables, you can use Apache Spark™ SQL commands.
  2. To determine the size of non-delta tables, calculate the total sum of the individual files within the underlying directory. Alternatively, you can use queryExecution.analyzed.stats to get the size.
  3. Azure Databricks doesn’t have built-in reports or charts specifically for data-loading trends. However, you can create custom visualizations using tools like Databricks Notebooks, which allow you to plot data and analyze trends over time. You can use libraries like Matplotlib or ggplot in Python or Scala to create graphs based on your data.

    If you have any further questions or need additional assistance, feel free to ask! 

Devsql
New Contributor III

Hi @Kaniz,

1-  Regarding this issue i had found below link:

https://kb.databricks.com/sql/find-size-of-table#:~:text=You%20can%20determine%20the%20size,stats%20...

Now to try above link, I need to decide: Delta-Table Vs Non-Delta-Table.

Few of my tables are Materialized Views while some are Streaming Tables. So I am Not able to understand about which one to try: Delta-Table Vs Non-Delta-Table.

Can you please help me in that ?

2-  Also in above link, for Delta-Table, for below line of code:

DeltaLog.forTable(spark, "dbfs:/<path-to-delta-table>")

I do Not know path: "dbfs:/<path-to-delta-table>".

From where I can get this path, UI Or Command ?

As i am working for corporate company, Admin have given only necessary (minimum) access. So if you guide me than I can ask Admin for permission.

3- When I opened your above given link and tried your solution, I got below error:

Cannot query a Streaming Table `DB`.`Schema`.`Table_append_raw` from an Assigned or No isolation shared cluster, please use a SHARED cluster or a Databricks SQL warehouse instead.

So from above I understood that we need to use SHARED cluster. Is there any other option we have ?

Thank you

Devsql

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!