<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Driver memory utilization grows continuously during job in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/driver-memory-utilization-grows-continuously-during-job/m-p/154680#M54120</link>
    <description>&lt;P&gt;You’re seeing (a monotonic / stair‑step climb in driver RAM over thousands of DEEP CLONE statements) is a very common pattern when the driver is not “holding data”, but holding metadata, query artifacts, and per‑command state that accumulates faster than the JVM can reclaim.&lt;BR /&gt;Even a “pure SQL DDL/DML” workload can bloat the driver because the driver is the control plane for:&lt;/P&gt;&lt;P&gt;parsing/analysis and query planning&lt;BR /&gt;tracking Spark SQL executions&lt;BR /&gt;holding session/catalog metadata&lt;BR /&gt;tracking job/stage/task events + SQL UI state&lt;BR /&gt;caching file indexes / transaction log snapshots&lt;BR /&gt;retaining objects due to references (listeners, accumulators, progress reporters)&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Reduce concurrency&lt;BR /&gt;Deep clones are heavy metadata + file movement operations. Running “too many” in parallel may overload the driver and also S3/API listing limits.&lt;BR /&gt;Try:&lt;/P&gt;&lt;P&gt;cut ForEach concurrency by 50–80%&lt;BR /&gt;aim for a concurrency level where driver memory flattens instead of rising&lt;/P&gt;&lt;P&gt;You’ll often get better overall throughput because you avoid driver GC thrash and retries.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Prefer SHALLOW CLONE + async copy when it meets your requirements&lt;BR /&gt;If your use case is “replicate table structure quickly” and you don’t immediately need independent data copies, shallow clone is dramatically cheaper, since it copies metadata and references source files rather than copying all data.&lt;BR /&gt;Then later:&lt;/P&gt;&lt;P&gt;convert to deep copy only for the subset that truly needs it, or&lt;BR /&gt;use other replication patterns.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;If doing deep clones for migration, consider alternative approaches&lt;BR /&gt;Depending on why you’re cloning (migration, environment refresh, etc.), alternatives can reduce driver overhead:&lt;/P&gt;&lt;P&gt;CTAS/CREATE TABLE AS SELECT (heavier compute but sometimes more stable)&lt;BR /&gt;incremental copy strategies (if source is changing)&lt;BR /&gt;external replication tools/workflows&lt;/P&gt;</description>
    <pubDate>Wed, 15 Apr 2026 18:49:27 GMT</pubDate>
    <dc:creator>nayan_wylde</dc:creator>
    <dc:date>2026-04-15T18:49:27Z</dc:date>
    <item>
      <title>Driver memory utilization grows continuously during job</title>
      <link>https://community.databricks.com/t5/data-engineering/driver-memory-utilization-grows-continuously-during-job/m-p/154352#M54086</link>
      <description>&lt;P&gt;I have a batch job that runs thousands of Deep Clone commands, it uses a ForEach task to run multiple Deep Clones in parallel. It was taking a very long time and I realized that the Driver was the main culprit since it was using up all of its memory a few minutes into the job. I increased the driver size and used a node type with a lot more memory. That improved performance significantly but the driver would still inevitably run out of memory and hit the same bottleneck, even now that it had 128GB of RAM.&lt;/P&gt;&lt;P&gt;You can see the incremental increase in memory utilization as the job progresses here:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="tsam_2-1776095245905.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/25946i23158E7ADE3DBD93/image-size/medium?v=v2&amp;amp;px=400" role="button" title="tsam_2-1776095245905.png" alt="tsam_2-1776095245905.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;By the end of the job, the Driver is using over 122GB of RAM, which seems excessive when all it's doing is running SQL Deep Clone commands without collecting any data.&lt;/P&gt;&lt;P&gt;What could cause so much bloat in this situation? And is there a way to avoid this from the start, or catch and remedy it during the job?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 13 Apr 2026 15:58:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/driver-memory-utilization-grows-continuously-during-job/m-p/154352#M54086</guid>
      <dc:creator>tsam</dc:creator>
      <dc:date>2026-04-13T15:58:32Z</dc:date>
    </item>
    <item>
      <title>Re: Driver memory utilization grows continuously during job</title>
      <link>https://community.databricks.com/t5/data-engineering/driver-memory-utilization-grows-continuously-during-job/m-p/154375#M54091</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/226856"&gt;@tsam&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;I think your problem might be caused by the fact that each call "CREATE OR REPLACE TABLE ... DEEP CLONE" accumulates state on the driver even though you're not collecting data.&lt;/P&gt;&lt;P&gt;The main culprits are:&lt;/P&gt;&lt;P&gt;1. &lt;STRONG&gt;Spark Plan / Query Plan Caching Every SQL&lt;/STRONG&gt; command generates a logical and physical plan that Spark caches in memory. With thousands of Deep Clone commands, these plans pile up and never get garbage collected during the job. Deep Clone plans are particularly heavy because they contain full table metadata, file listings, and schema information for both source and target.&lt;/P&gt;&lt;P&gt;2. &lt;STRONG&gt;Spark Listener Event Queue&lt;/STRONG&gt; The Spark UI event log and listener accumulate SparkListenerEvent objects for every completed query - stage info, task metrics, SQL execution details. Thousands of clones means thousands of events sitting in the driver's heap.&lt;/P&gt;&lt;P&gt;3. &lt;STRONG&gt;Delta Log State Each Deep Clone&lt;/STRONG&gt; reads the Delta transaction log of the source table. The driver holds onto DeltaLog snapshot objects, and Delta's internal log cache can grow very large across thousands of distinct tables.&amp;nbsp;&lt;/P&gt;&lt;P&gt;To mitigate this issue you can take following approach.&amp;nbsp;&lt;STRONG&gt;Batch and restart the SparkSession periodically.&lt;/STRONG&gt; This should be quite effective approach - chunk your clone list into batches (say 50–100 tables) and between batches, clear accumulated state:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql import SparkSession

def run_deep_clones(table_list, batch_size=50):
    for i in range(0, len(table_list), batch_size):
        batch = table_list[i : i + batch_size]
        
        for table in batch:
            spark.sql(f"CREATE OR REPLACE TABLE {table['target']} DEEP CLONE {table['source']}")
        
        # Force cleanup between batches
        spark.catalog.clearCache()
        spark._jvm.System.gc()  # Suggest JVM GC
        
        print(f"Completed batch {i // batch_size + 1}, "
              f"{min(i + batch_size, len(table_list))}/{len(table_list)} tables done")&lt;/LI-CODE&gt;</description>
      <pubDate>Mon, 13 Apr 2026 19:16:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/driver-memory-utilization-grows-continuously-during-job/m-p/154375#M54091</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2026-04-13T19:16:36Z</dc:date>
    </item>
    <item>
      <title>Re: Driver memory utilization grows continuously during job</title>
      <link>https://community.databricks.com/t5/data-engineering/driver-memory-utilization-grows-continuously-during-job/m-p/154647#M54119</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/226856"&gt;@tsam&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;Can you share few details:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Which DBR is the job on?&lt;/LI&gt;
&lt;LI&gt;How many DEEP CLONEs you need to run in total?&lt;/LI&gt;
&lt;LI&gt;What is the parallelism of the for-each task?&lt;/LI&gt;
&lt;LI&gt;Are the cloned tables optimized (e.g. there is no "small file problem")?&lt;/LI&gt;
&lt;LI&gt;Can you share the Heap Histogram of the Driver (can be found in the Spark UI)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;In parallel, a simple fix that I can suggest is to run it on the most recent DBR version.&lt;/P&gt;
&lt;P&gt;Best regards,&lt;/P&gt;</description>
      <pubDate>Wed, 15 Apr 2026 12:45:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/driver-memory-utilization-grows-continuously-during-job/m-p/154647#M54119</guid>
      <dc:creator>aleksandra_ch</dc:creator>
      <dc:date>2026-04-15T12:45:01Z</dc:date>
    </item>
    <item>
      <title>Re: Driver memory utilization grows continuously during job</title>
      <link>https://community.databricks.com/t5/data-engineering/driver-memory-utilization-grows-continuously-during-job/m-p/154680#M54120</link>
      <description>&lt;P&gt;You’re seeing (a monotonic / stair‑step climb in driver RAM over thousands of DEEP CLONE statements) is a very common pattern when the driver is not “holding data”, but holding metadata, query artifacts, and per‑command state that accumulates faster than the JVM can reclaim.&lt;BR /&gt;Even a “pure SQL DDL/DML” workload can bloat the driver because the driver is the control plane for:&lt;/P&gt;&lt;P&gt;parsing/analysis and query planning&lt;BR /&gt;tracking Spark SQL executions&lt;BR /&gt;holding session/catalog metadata&lt;BR /&gt;tracking job/stage/task events + SQL UI state&lt;BR /&gt;caching file indexes / transaction log snapshots&lt;BR /&gt;retaining objects due to references (listeners, accumulators, progress reporters)&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Reduce concurrency&lt;BR /&gt;Deep clones are heavy metadata + file movement operations. Running “too many” in parallel may overload the driver and also S3/API listing limits.&lt;BR /&gt;Try:&lt;/P&gt;&lt;P&gt;cut ForEach concurrency by 50–80%&lt;BR /&gt;aim for a concurrency level where driver memory flattens instead of rising&lt;/P&gt;&lt;P&gt;You’ll often get better overall throughput because you avoid driver GC thrash and retries.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Prefer SHALLOW CLONE + async copy when it meets your requirements&lt;BR /&gt;If your use case is “replicate table structure quickly” and you don’t immediately need independent data copies, shallow clone is dramatically cheaper, since it copies metadata and references source files rather than copying all data.&lt;BR /&gt;Then later:&lt;/P&gt;&lt;P&gt;convert to deep copy only for the subset that truly needs it, or&lt;BR /&gt;use other replication patterns.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;If doing deep clones for migration, consider alternative approaches&lt;BR /&gt;Depending on why you’re cloning (migration, environment refresh, etc.), alternatives can reduce driver overhead:&lt;/P&gt;&lt;P&gt;CTAS/CREATE TABLE AS SELECT (heavier compute but sometimes more stable)&lt;BR /&gt;incremental copy strategies (if source is changing)&lt;BR /&gt;external replication tools/workflows&lt;/P&gt;</description>
      <pubDate>Wed, 15 Apr 2026 18:49:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/driver-memory-utilization-grows-continuously-during-job/m-p/154680#M54120</guid>
      <dc:creator>nayan_wylde</dc:creator>
      <dc:date>2026-04-15T18:49:27Z</dc:date>
    </item>
  </channel>
</rss>

