topic Cluster crashes occasionally but not all of the time in Data Engineering

Cluster crashes occasionally but not all of the time

NotCuriosAtAll — Tue, 13 Jan 2026 14:39:13 GMT

We have a small cluster (Standard D2ads v6) with 8 gigs of ram and 2 cores. This is an all-purpose cluster and for some reason, the client demands to use this one for our ETL process. The ETL process is simple, the client drops parquet files in the blob storage and then a databricks job is scheduled everyday to read the files from the blob, save the content into a hive_metastore table and move the parquet files from the blob in an Archive location.

Currently the biggest table that we have has 66 millions of rows and it's getting enriched every day. In total, we have 7 tables but recently an issues started popping up. Occasionally, which is weird, the pipeline fails even though we receive a similar amount of data on a daily basis. For example, today it might fail but tomorrow it might finish without any issues and pretty fast. The failure message is: Run failed with error message; Could not reach driver of cluster xxx-xxxxx-xxxx

The Metrics tab shows me a 100% utilization of the memory and nearly 100% of the CPU. My code is mostly a spark code, except a few places where I use `.collect()` but this is on a small size (table with 7 rows). The thing that I'm confused about is, why does it fail occasionally and not all of the time if there's some memory/performance constraints regarding the compute? I tried to optimize the memory by clearing the cache but I still get fails from time to time.

Also to mention, the compute is used only by the job so there isn't any other computations on it.

Re: Cluster crashes occasionally but not all of the time

ManojkMohan — Tue, 13 Jan 2026 17:43:18 GMT

@NotCuriosAtAll

Can you try the following

Issue	Fix	Reference Links
Driver undersized	Request i3.xlarge driver (16GB/4c equiv); single-node job cluster	80% reliability boost - https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/td-p/62164
All-purpose shared	Switch to job cluster (new_cheap); terminate post-run	No state buildup databricks
Hive commits	Batch daily into Delta; OPTIMIZE weekly	50% faster appends linkedin
Monitoring	Job alerts on driver metrics; auto-scale min 2 workers	Catch 100% spikes early

Re: Cluster crashes occasionally but not all of the time

szymon_dybczak — Tue, 13 Jan 2026 20:57:23 GMT

Hi @NotCuriosAtAll ,

You have undersized cluster for your workload. This error is typical on driver node with that high cpu consumption. You can check below article (and related solution):
Job run fails with error message “Could not reach driver of cluster” - Databricks

If I were you I would just increase your compute. Your job sometimes can work beacuse each day the amount of data is bit different. But if you see nearly 100% of cpu and memory consumption each day then for sure your workload demands bigger compute (or optimization of your code)