cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Cluster crashes occasionally but not all of the time

NotCuriosAtAll
New Contributor

We have a small cluster (Standard D2ads v6) with 8 gigs of ram and 2 cores. This is an all-purpose cluster and for some reason, the client demands to use this one for our ETL process. The ETL process is simple, the client drops parquet files in the blob storage and then a databricks job is scheduled everyday to read the files from the blob, save the content into a hive_metastore table and move the parquet files from the blob in an Archive location.

Currently the biggest table that we have has 66 millions of rows and it's getting enriched every day. In total, we have 7 tables but recently an issues started popping up. Occasionally, which is weird, the pipeline fails even though we receive a similar amount of data on a daily basis. For example, today it might fail but tomorrow it might finish without any issues and pretty fast. The failure message is: Run failed with error message; Could not reach driver of cluster xxx-xxxxx-xxxx

The Metrics tab shows me a 100% utilization of the memory and nearly 100% of the CPU. My code is mostly a spark code, except a few places where I use `.collect()` but this is on a small size (table with 7 rows). The thing that I'm confused about is, why does it fail occasionally and not all of the time if there's some memory/performance constraints regarding the compute? I tried to optimize the memory by clearing the cache but I still get fails from time to time.

Also to mention, the compute is used only by the job so there isn't any other computations on it.

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @NotCuriosAtAll ,

You have undersized cluster for your workload. This error is typical on driver node with that high cpu consumption. You can check below article (and related solution):
Job run fails with error message “Could not reach driver of cluster” - Databricks

If I were you I would just increase your compute. Your job sometimes can work beacuse each day the amount of data is bit different. But if you see nearly 100% of cpu and memory consumption each day then for sure your workload demands bigger compute (or optimization of your code)

View solution in original post

2 REPLIES 2

ManojkMohan
Honored Contributor II

@NotCuriosAtAll 

Can you try the following

IssueFixReference Links
Driver undersizedRequest i3.xlarge driver (16GB/4c equiv); single-node job cluster80% reliability boost ​ - https://community.databricks.com/t5/data-engineering/could-not-reach-driver-of-cluster/td-p/62164
All-purpose sharedSwitch to job cluster (new_cheap); terminate post-runNo state buildup databricks​
Hive commitsBatch daily into Delta; OPTIMIZE weekly50% faster appends linkedin​
MonitoringJob alerts on driver metrics; auto-scale min 2 workersCatch 100% spikes early

szymon_dybczak
Esteemed Contributor III

Hi @NotCuriosAtAll ,

You have undersized cluster for your workload. This error is typical on driver node with that high cpu consumption. You can check below article (and related solution):
Job run fails with error message “Could not reach driver of cluster” - Databricks

If I were you I would just increase your compute. Your job sometimes can work beacuse each day the amount of data is bit different. But if you see nearly 100% of cpu and memory consumption each day then for sure your workload demands bigger compute (or optimization of your code)