cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Concurrent Jobs - The spark driver has stopped unexpectedly!

uzairm
New Contributor III

Hi, I am running concurrent notebooks in concurrent workflow jobs in job compute cluster c5a.8xlarge with 5-7 worker nodes. Each job has 100 concurrent child notebooks and there are 10 job instances. 8/10 jobs gives the error the spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

How can I resolve that?

1 ACCEPTED SOLUTION

Accepted Solutions

uzairm
New Contributor III

I have identified the issue. The driver memory is exhausting and the worker nodes are not coming into play...

View solution in original post

12 REPLIES 12

daniel_sahal
Esteemed Contributor

@uzair mustafa​ 

Check the Ganglia for performance related issues (maybe it's getting OOM?).

uzairm
New Contributor III

Hi,

I have been checking Ganglia. Free Space is about 300GB available.

daniel_sahal
Esteemed Contributor

@uzair mustafa​ 

It's hard to answer without digging into the logs and code.

uzairm
New Contributor III

I have identified the issue. The driver memory is exhausting and the worker nodes are not coming into play...

daniel_sahal
Esteemed Contributor

@uzair mustafa​ 

So basically, that's what I expected, OOM 🙂

It's good that you were able to find an issue.

Anonymous
Not applicable

Hi @uzair mustafa​ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 

Cheers!

uzairm
New Contributor III
Hi Vidula,
I still face the issue and it has not been resolved. It would be great if some one helps me.

Did you solve this issue ? I'm in similar situation.

uzairm
New Contributor III

@Jeetash Kumar​ I identified the issue, which was the driver memory was getting exhausted. I fine tuned my code so that lesser operations are done on the driver side and I reduced the concurrency of my tasks. This answer is based on my use case.

uzairm
New Contributor III

You can take a look at your driver memory by looking at the Ganglia UI, monitor it as it your cluster runs..

Yes, I'm monitoring driver memory in Ganglia (Attaching SS of the driver node).

what might be the list of operations are done on the driver side which I need to avoid ?

uzairm
New Contributor III
Operations like collect() select() are done on driver node. All aggregations. Avoid those.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group