I am trying to list all delta tables in a database and retrieve the following columns: `totalsizeinbyte`, `sizeinbyte` (i.e. the size of last snap shot size) and `created_by` (`lastmodified_by` could also work). Checking online I came across the following post, where you can almost achieve this task without the user information in `scala`: https://stackoverflow.com/a/73893361/6903605 .
The problem is I would like to get results using `python` (`sql` could also work).
Attempt #1 : ``scala solution using the post above retrieves `totalsizeinbyte`, `sizeinbyte` but not `created_by`. Also this is scala, not python.
%scala
import com.databricks.sql.transaction.tahoe._
val databasePath = "dbfs:/user/hive/databasename.db"
def size(path: String): Long =
dbutils.fs.ls(path).map { fi => if (fi.isDir) size(fi.path) else fi.size }.sum
val tables = dbutils.fs.ls(databasePath).par.map { fi =>
val totalSize = size(fi.path)
val snapshotSize = DeltaLog.forTable(spark, fi.path).snapshot.sizeInBytes
(fi.name, totalSize, snapshotSize)
}
display(tables.seq.sorted.toDF("name", "total_size_in_byte","snapshotSize_in_byte"))
// df.write.toTable("<table-name>")
Attempt #2:
Looping `sql` query using `python`. Here, total size and created_by is missing.
from functools import reduce
from pyspark.sql import DataFrame
db_name = 'databasename'
#Create initial df of all tables in a database
tbl_lst= spark.sql("SHOW TABLES IN {}".format(db_name))
tbl_lst.createOrReplaceTempView("tbl_lst")
#Create array of all database tables
table_array= spark.sql("select collect_list(tableName) from tbl_lst where isTemporary == 'false'").collect()[0][0]
#For loop to get describe detail for each table in the array
sql_lst = [f"DESCRIBE DETAIL {db_name}.{table}" for table in table_array]
all_tbls=[]
success=0
fail=0
for sql in sql_lst:
try:
all_tbls.append(spark.sql(sql))
success=success+1
except:
print('Error in:',sql)
fail=fail+1
#Union multiple dataframe into one
tbl_details = reduce(DataFrame.unionAll, all_tbls)
tbl_details.createOrReplaceTempView("db_tbls_detail")
display(tbl_details)