<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Cluster crashes occasionally but not all of the time in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/cluster-crashes-occasionally-but-not-all-of-the-time/m-p/143900#M52234</link>
    <description>&lt;P&gt;We have a small cluster (Standard D2ads v6) with 8 gigs of ram and 2 cores. This is an all-purpose cluster and for some reason, the client demands to use this one for our ETL process. The ETL process is simple, the client drops parquet files in the blob storage and then a databricks job is scheduled everyday to read the files from the blob, save the content into a &lt;STRONG&gt;&lt;EM&gt;hive_metastore&lt;/EM&gt; &lt;/STRONG&gt;table and move the parquet files from the blob in an Archive location.&lt;/P&gt;&lt;P&gt;Currently the biggest table that we have has 66 millions of rows and it's getting enriched every day. In total, we have 7 tables but recently an issues started popping up. Occasionally, which is weird, the pipeline fails even though we receive a similar amount of data on a daily basis. For example, today it might fail but tomorrow it might finish without any issues and pretty fast. The failure message is:&amp;nbsp;&lt;EM&gt;Run failed with error message; Could not reach driver of cluster xxx-xxxxx-xxxx&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;The Metrics tab shows me a 100% utilization of the memory and nearly 100% of the CPU. My code is mostly a spark code, except a few places where I use `.collect()` but this is on a small size (table with 7 rows). The thing that I'm confused about is, why does it fail occasionally and not all of the time if there's some memory/performance constraints regarding the compute? I tried to optimize the memory by clearing the cache but I still get fails from time to time.&lt;/P&gt;&lt;P&gt;Also to mention, the compute is used only by the job so there isn't any other computations on it.&lt;/P&gt;</description>
    <pubDate>Tue, 13 Jan 2026 14:39:13 GMT</pubDate>
    <dc:creator>NotCuriosAtAll</dc:creator>
    <dc:date>2026-01-13T14:39:13Z</dc:date>
    <item>
      <title>Cluster crashes occasionally but not all of the time</title>
      <link>https://community.databricks.com/t5/data-engineering/cluster-crashes-occasionally-but-not-all-of-the-time/m-p/143900#M52234</link>
      <description>&lt;P&gt;We have a small cluster (Standard D2ads v6) with 8 gigs of ram and 2 cores. This is an all-purpose cluster and for some reason, the client demands to use this one for our ETL process. The ETL process is simple, the client drops parquet files in the blob storage and then a databricks job is scheduled everyday to read the files from the blob, save the content into a &lt;STRONG&gt;&lt;EM&gt;hive_metastore&lt;/EM&gt; &lt;/STRONG&gt;table and move the parquet files from the blob in an Archive location.&lt;/P&gt;&lt;P&gt;Currently the biggest table that we have has 66 millions of rows and it's getting enriched every day. In total, we have 7 tables but recently an issues started popping up. Occasionally, which is weird, the pipeline fails even though we receive a similar amount of data on a daily basis. For example, today it might fail but tomorrow it might finish without any issues and pretty fast. The failure message is:&amp;nbsp;&lt;EM&gt;Run failed with error message; Could not reach driver of cluster xxx-xxxxx-xxxx&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;The Metrics tab shows me a 100% utilization of the memory and nearly 100% of the CPU. My code is mostly a spark code, except a few places where I use `.collect()` but this is on a small size (table with 7 rows). The thing that I'm confused about is, why does it fail occasionally and not all of the time if there's some memory/performance constraints regarding the compute? I tried to optimize the memory by clearing the cache but I still get fails from time to time.&lt;/P&gt;&lt;P&gt;Also to mention, the compute is used only by the job so there isn't any other computations on it.&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jan 2026 14:39:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cluster-crashes-occasionally-but-not-all-of-the-time/m-p/143900#M52234</guid>
      <dc:creator>NotCuriosAtAll</dc:creator>
      <dc:date>2026-01-13T14:39:13Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster crashes occasionally but not all of the time</title>
      <link>https://community.databricks.com/t5/data-engineering/cluster-crashes-occasionally-but-not-all-of-the-time/m-p/143922#M52238</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/207558"&gt;@NotCuriosAtAll&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Can you try the following&lt;/P&gt;&lt;TABLE width="503"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="91"&gt;Issue&lt;/TD&gt;&lt;TD width="180"&gt;Fix&lt;/TD&gt;&lt;TD width="232"&gt;Reference Links&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="91"&gt;Driver undersized&lt;/TD&gt;&lt;TD width="180"&gt;Request i3.xlarge driver (16GB/4c equiv); single-node job cluster&lt;/TD&gt;&lt;TD width="232"&gt;80% reliability boost ​ - &lt;A href="https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/td-p/62164" target="_blank"&gt;https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/td-p/62164&lt;/A&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="91"&gt;All-purpose shared&lt;/TD&gt;&lt;TD width="180"&gt;Switch to job cluster (new_cheap); terminate post-run&lt;/TD&gt;&lt;TD width="232"&gt;&lt;A href="https://kb.databricks.com/jobs/driver-unavailable" target="_blank"&gt;No state buildup databricks​&lt;/A&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="91"&gt;Hive commits&lt;/TD&gt;&lt;TD width="180"&gt;Batch daily into Delta; OPTIMIZE weekly&lt;/TD&gt;&lt;TD width="232"&gt;&lt;A href="https://www.linkedin.com/posts/baljeetjangra_processing-2billion-rows-efficiently-in-activity-7384803148332396544-6Y_w" target="_blank"&gt;50% faster appends linkedin​&lt;/A&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="91"&gt;Monitoring&lt;/TD&gt;&lt;TD width="180"&gt;Job alerts on driver metrics; auto-scale min 2 workers&lt;/TD&gt;&lt;TD width="232"&gt;Catch 100% spikes early&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;</description>
      <pubDate>Tue, 13 Jan 2026 17:43:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cluster-crashes-occasionally-but-not-all-of-the-time/m-p/143922#M52238</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2026-01-13T17:43:18Z</dc:date>
    </item>
    <item>
      <title>Re: Cluster crashes occasionally but not all of the time</title>
      <link>https://community.databricks.com/t5/data-engineering/cluster-crashes-occasionally-but-not-all-of-the-time/m-p/143944#M52241</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/207558"&gt;@NotCuriosAtAll&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;You have undersized cluster for your workload.&amp;nbsp;This error is typical on driver node with that high cpu consumption. You can check below article (and related solution):&lt;BR /&gt;&lt;A href="https://kb.databricks.com/clusters/job-run-fails-with-error-message-could-not-reach-driver-of-cluster" target="_blank" rel="noopener"&gt;Job run fails with error message “Could not reach driver of cluster” - Databricks&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;If I were you I would just increase your compute. Your job sometimes can work beacuse each day the amount of data is bit different. But if you see nearly 100% of cpu and memory consumption each day then for sure your workload demands bigger compute (or optimization of your code)&lt;/P&gt;</description>
      <pubDate>Tue, 13 Jan 2026 20:57:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cluster-crashes-occasionally-but-not-all-of-the-time/m-p/143944#M52241</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2026-01-13T20:57:23Z</dc:date>
    </item>
  </channel>
</rss>

