JVM Heap Leak When Iterating Over Large Number of Tables Using DESCRIBE DETAIL

unnamedchunk — Mon, 07 Apr 2025 19:24:37 GMT

Problem:

I'm trying to generate a consolidated metadata table for all tables within a Databricks database (I do not have admin privileges). The process works fine for the first few thousand tables, but as it progresses, the driver node eventually crashes with a JVM heap memory issue.

Goal:

Create a table (e.g., metadata_table) in my_database that stores metadata for each table using DESCRIBE DETAIL.

Additionally, I would like to do the same for historical metadata using DESCRIBE HISTORY, saved to a similar table (e.g., metadata_history_table). I suspect I’ll encounter similar performance and memory issues with that as well.

Current Code (for DESCRIBE DETAIL):

database = 'my_database' tables_df = spark.sql(f"SHOW TABLES IN {database}") table_names = [row.tableName for row in tables_df.collect()] for i, table in enumerate(table_names, start=1): try: df = spark.sql(f"DESCRIBE DETAIL {database}.{table}") df = df.withColumn("database", f.lit(database)).withColumn("table_name", f.lit(table)) df.write.mode('append').saveAsTable('my_database.metadata_table') except Exception as e: print(f"Error with {table}: {e}")

Context:

The database contains ~10,000 tables.
Processing proceeds fine until about the 3000th table.
After that, performance degrades rapidly and the notebook eventually fails with:

Internal error. Attach your notebook to a different compute or restart the current compute. com.databricks.backend.daemon.driver.DriverClientDestroyedException: abort: Driver Client destroyed

Observations:

Spark UI shows red "Task Time" for the driver with high "GC Time".
Storage Memory does not appear to be under pressure. Screenshot of Spark UI Executors
JVM heap usage increases steadily with each iteration:

# Logging JVM heap usage def log_jvm_heap(): rt = spark.jvm.java.lang.Runtime.getRuntime() used = (rt.totalMemory() - rt.freeMemory()) / (1024 * 1024) total = rt.maxMemory() / (1024 * 1024) print(f"[Heap] {used:.2f} MB / {total:.2f} MB")

Sample output:

[Heap] 1328.19 MB / 43215.00 MB # iteration 100 [Heap] 2215.11 MB / 43526.50 MB # iteration 200 ... [Heap] 18718.88 MB / 43526.00 MB # iteration 1600 [Heap] 20008.39 MB / 43495.50 MB # iteration 1700

What I’ve Tried:

Explicit memory cleanup after each iteration:

df = None del df gc.collect() spark._jvm.java.lang.System.gc()

Batching the write every 100 tables to reduce write frequency

Unfortunately, neither attempt prevents the memory usage from growing.

Questions:

Is there a reliable way to force JVM heap cleanup in this context?
Is this behavior a known limitation when using DESCRIBE DETAIL or DESCRIBE HISTORY iteratively at scale?
Are there more memory-efficient or scalable alternatives to gather metadata across many tables without requiring admin privileges?

Any advice or guidance is much appreciated! Thank you!

Re: JVM Heap Leak When Iterating Over Large Number of Tables Using DESCRIBE DETAIL

cgrant — Tue, 20 May 2025 05:47:36 GMT

It's best to iterate over information_schema's TABLES table instead of listing yourself.

topic JVM Heap Leak When Iterating Over Large Number of Tables Using DESCRIBE DETAIL in Data Engineering

JVM Heap Leak When Iterating Over Large Number of Tables Using DESCRIBE DETAIL

Re: JVM Heap Leak When Iterating Over Large Number of Tables Using DESCRIBE DETAIL