JVM Heap Leak When Iterating Over Large Number of Tables Using DESCRIBE DETAIL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago - last edited 2 weeks ago
Problem:
I'm trying to generate a consolidated metadata table for all tables within a Databricks database (I do not have admin privileges). The process works fine for the first few thousand tables, but as it progresses, the driver node eventually crashes with a JVM heap memory issue.
Goal:
Create a table (e.g., metadata_table) in my_database that stores metadata for each table using DESCRIBE DETAIL.
Additionally, I would like to do the same for historical metadata using DESCRIBE HISTORY, saved to a similar table (e.g., metadata_history_table). I suspect I’ll encounter similar performance and memory issues with that as well.
Current Code (for DESCRIBE DETAIL):
database = 'my_database'
tables_df = spark.sql(f"SHOW TABLES IN {database}")
table_names = [row.tableName for row in tables_df.collect()]
for i, table in enumerate(table_names, start=1):
try:
df = spark.sql(f"DESCRIBE DETAIL {database}.{table}")
df = df.withColumn("database", f.lit(database)).withColumn("table_name", f.lit(table))
df.write.mode('append').saveAsTable('my_database.metadata_table')
except Exception as e:
print(f"Error with {table}: {e}")
Context:
- The database contains ~10,000 tables.
- Processing proceeds fine until about the 3000th table.
- After that, performance degrades rapidly and the notebook eventually fails with:
Internal error. Attach your notebook to a different compute or restart the current compute.
com.databricks.backend.daemon.driver.DriverClientDestroyedException: abort: Driver Client destroyed
Observations:
- Spark UI shows red "Task Time" for the driver with high "GC Time".
- Storage Memory does not appear to be under pressure. Screenshot of Spark UI Executors
- JVM heap usage increases steadily with each iteration:
# Logging JVM heap usage
def log_jvm_heap():
rt = spark.jvm.java.lang.Runtime.getRuntime()
used = (rt.totalMemory() - rt.freeMemory()) / (1024 * 1024)
total = rt.maxMemory() / (1024 * 1024)
print(f"[Heap] {used:.2f} MB / {total:.2f} MB")
Sample output:
[Heap] 1328.19 MB / 43215.00 MB # iteration 100
[Heap] 2215.11 MB / 43526.50 MB # iteration 200
...
[Heap] 18718.88 MB / 43526.00 MB # iteration 1600
[Heap] 20008.39 MB / 43495.50 MB # iteration 1700
What I’ve Tried:
- Explicit memory cleanup after each iteration:
df = None
del df
gc.collect()
spark._jvm.java.lang.System.gc()
- Batching the write every 100 tables to reduce write frequency
Unfortunately, neither attempt prevents the memory usage from growing.
Questions:
- Is there a reliable way to force JVM heap cleanup in this context?
- Is this behavior a known limitation when using DESCRIBE DETAIL or DESCRIBE HISTORY iteratively at scale?
- Are there more memory-efficient or scalable alternatives to gather metadata across many tables without requiring admin privileges?
Any advice or guidance is much appreciated! Thank you!
- Labels:
-
Delta Lake
-
Spark

