Hi @tsam ,
I think your problem might be caused by the fact that each call "CREATE OR REPLACE TABLE ... DEEP CLONE" accumulates state on the driver even though you're not collecting data.
The main culprits are:
1. Spark Plan / Query Plan Caching Every SQL command generates a logical and physical plan that Spark caches in memory. With thousands of Deep Clone commands, these plans pile up and never get garbage collected during the job. Deep Clone plans are particularly heavy because they contain full table metadata, file listings, and schema information for both source and target.
2. Spark Listener Event Queue The Spark UI event log and listener accumulate SparkListenerEvent objects for every completed query - stage info, task metrics, SQL execution details. Thousands of clones means thousands of events sitting in the driver's heap.
3. Delta Log State Each Deep Clone reads the Delta transaction log of the source table. The driver holds onto DeltaLog snapshot objects, and Delta's internal log cache can grow very large across thousands of distinct tables.
To mitigate this issue you can take following approach. Batch and restart the SparkSession periodically. This should be quite effective approach - chunk your clone list into batches (say 50–100 tables) and between batches, clear accumulated state:
from pyspark.sql import SparkSession
def run_deep_clones(table_list, batch_size=50):
for i in range(0, len(table_list), batch_size):
batch = table_list[i : i + batch_size]
for table in batch:
spark.sql(f"CREATE OR REPLACE TABLE {table['target']} DEEP CLONE {table['source']}")
# Force cleanup between batches
spark.catalog.clearCache()
spark._jvm.System.gc() # Suggest JVM GC
print(f"Completed batch {i // batch_size + 1}, "
f"{min(i + batch_size, len(table_list))}/{len(table_list)} tables done")