Databricks Community

Volkan_Gumuskay · ‎07-19-2023

I am trying to list all delta tables in a database and retrieve the following columns: `totalsizeinbyte`, `sizeinbyte` (i.e. the size of last snap shot size) and `created_by` (`lastmodified_by` could also work). Checking online I came across the following post, where you can almost achieve this task without the user information in `scala`: https://stackoverflow.com/a/73893361/6903605 .

The problem is I would like to get results using `python` (`sql` could also work).

Attempt #1 : ``scala solution using the post above retrieves `totalsizeinbyte`, `sizeinbyte` but not `created_by`. Also this is scala, not python.

%scala
import com.databricks.sql.transaction.tahoe._
 
val databasePath = "dbfs:/user/hive/databasename.db"
 
def size(path: String): Long =
  dbutils.fs.ls(path).map { fi => if (fi.isDir) size(fi.path) else fi.size }.sum
 
val tables = dbutils.fs.ls(databasePath).par.map { fi =>
  val totalSize = size(fi.path)
  val snapshotSize = DeltaLog.forTable(spark, fi.path).snapshot.sizeInBytes
  (fi.name, totalSize, snapshotSize)
}
display(tables.seq.sorted.toDF("name", "total_size_in_byte","snapshotSize_in_byte"))
// df.write.toTable("<table-name>")

Attempt #2:

Looping `sql` query using `python`. Here, total size and created_by is missing.

from functools import reduce
from pyspark.sql import DataFrame

db_name = 'databasename'

#Create initial df of all tables in a database
tbl_lst= spark.sql("SHOW TABLES IN {}".format(db_name))

tbl_lst.createOrReplaceTempView("tbl_lst")
#Create array of all database tables
table_array= spark.sql("select collect_list(tableName) from tbl_lst where isTemporary == 'false'").collect()[0][0]
#For loop to get describe detail for each table in the array
sql_lst = [f"DESCRIBE DETAIL {db_name}.{table}" for table in table_array]

all_tbls=[]
success=0
fail=0
for sql in sql_lst:
  try:
    all_tbls.append(spark.sql(sql))
    success=success+1
  except:
    print('Error in:',sql)
    fail=fail+1

#Union multiple dataframe into one
tbl_details = reduce(DataFrame.unionAll, all_tbls)
tbl_details.createOrReplaceTempView("db_tbls_detail")
display(tbl_details)

Databricks Community

List all delta tables in a database with total size, last snapshot size and user using python/sql

🔔 ALERT: Act Now to Protect Your Community Account; Secure Your Details Before It's Too Late!

Databricks Learning Festival (Virtual): 10 July - 24 July 2024

Data + AI Summit 2024: An Executive Summary for Data Leaders

Big Data Is Back and Is More Important Than AI