cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Concurrent Jobs - The spark driver has stopped unexpectedly!

uzairm
New Contributor III

Hi, I am running concurrent notebooks in concurrent workflow jobs in job compute cluster c5a.8xlarge with 5-7 worker nodes. Each job has 100 concurrent child notebooks and there are 10 job instances. 8/10 jobs gives the error the spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

How can I resolve that?

1 ACCEPTED SOLUTION

Accepted Solutions

uzairm
New Contributor III

I have identified the issue. The driver memory is exhausting and the worker nodes are not coming into play...

View solution in original post

12 REPLIES 12

daniel_sahal
Esteemed Contributor

@uzair mustafa​ 

Check the Ganglia for performance related issues (maybe it's getting OOM?).

uzairm
New Contributor III

Hi,

I have been checking Ganglia. Free Space is about 300GB available.

daniel_sahal
Esteemed Contributor

@uzair mustafa​ 

It's hard to answer without digging into the logs and code.

uzairm
New Contributor III

I have identified the issue. The driver memory is exhausting and the worker nodes are not coming into play...

daniel_sahal
Esteemed Contributor

@uzair mustafa​ 

So basically, that's what I expected, OOM 🙂

It's good that you were able to find an issue.

Anonymous
Not applicable

Hi @uzair mustafa​ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 

Cheers!

uzairm
New Contributor III
Hi Vidula,
I still face the issue and it has not been resolved. It would be great if some one helps me.

JKR
New Contributor III

Did you solve this issue ? I'm in similar situation.

uzairm
New Contributor III

@Jeetash Kumar​ I identified the issue, which was the driver memory was getting exhausted. I fine tuned my code so that lesser operations are done on the driver side and I reduced the concurrency of my tasks. This answer is based on my use case.

uzairm
New Contributor III

You can take a look at your driver memory by looking at the Ganglia UI, monitor it as it your cluster runs..

JKR
New Contributor III

Yes, I'm monitoring driver memory in Ganglia (Attaching SS of the driver node).

what might be the list of operations are done on the driver side which I need to avoid ?

uzairm
New Contributor III
Operations like collect() select() are done on driver node. All aggregations. Avoid those.
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.