Databricks Community

unnamedchunk · 2 weeks ago

Problem:

I'm trying to generate a consolidated metadata table for all tables within a Databricks database (I do not have admin privileges). The process works fine for the first few thousand tables, but as it progresses, the driver node eventually crashes with a JVM heap memory issue.

Goal:

Create a table (e.g., metadata_table) in my_database that stores metadata for each table using DESCRIBE DETAIL.

Additionally, I would like to do the same for historical metadata using DESCRIBE HISTORY, saved to a similar table (e.g., metadata_history_table). I suspect I’ll encounter similar performance and memory issues with that as well.

Current Code (for DESCRIBE DETAIL):

database = 'my_database'
tables_df = spark.sql(f"SHOW TABLES IN {database}")
table_names = [row.tableName for row in tables_df.collect()]

for i, table in enumerate(table_names, start=1):
    try:
        df = spark.sql(f"DESCRIBE DETAIL {database}.{table}")
        df = df.withColumn("database", f.lit(database)).withColumn("table_name", f.lit(table))
        df.write.mode('append').saveAsTable('my_database.metadata_table')
    except Exception as e:
        print(f"Error with {table}: {e}")

Context:

The database contains ~10,000 tables.
Processing proceeds fine until about the 3000th table.
After that, performance degrades rapidly and the notebook eventually fails with:

Internal error. Attach your notebook to a different compute or restart the current compute.
com.databricks.backend.daemon.driver.DriverClientDestroyedException: abort: Driver Client destroyed

Observations:

Spark UI shows red "Task Time" for the driver with high "GC Time".
Storage Memory does not appear to be under pressure. Screenshot of Spark UI Executors
JVM heap usage increases steadily with each iteration:

# Logging JVM heap usage
def log_jvm_heap():
rt = spark.jvm.java.lang.Runtime.getRuntime()
used = (rt.totalMemory() - rt.freeMemory()) / (1024 * 1024)
total = rt.maxMemory() / (1024 * 1024)
print(f"[Heap] {used:.2f} MB / {total:.2f} MB")

Sample output:

[Heap] 1328.19 MB / 43215.00 MB # iteration 100
[Heap] 2215.11 MB / 43526.50 MB # iteration 200
...
[Heap] 18718.88 MB / 43526.00 MB # iteration 1600
[Heap] 20008.39 MB / 43495.50 MB # iteration 1700

What I’ve Tried:

Explicit memory cleanup after each iteration:

df = None
del df
gc.collect()
spark._jvm.java.lang.System.gc()

Batching the write every 100 tables to reduce write frequency

Unfortunately, neither attempt prevents the memory usage from growing.

Questions:

Is there a reliable way to force JVM heap cleanup in this context?
Is this behavior a known limitation when using DESCRIBE DETAIL or DESCRIBE HISTORY iteratively at scale?
Are there more memory-efficient or scalable alternatives to gather metadata across many tables without requiring admin privileges?

Any advice or guidance is much appreciated! Thank you!