02-27-2023 12:29 PM
Hi, I am running concurrent notebooks in concurrent workflow jobs in job compute cluster c5a.8xlarge with 5-7 worker nodes. Each job has 100 concurrent child notebooks and there are 10 job instances. 8/10 jobs gives the error the spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.
How can I resolve that?
02-28-2023 11:12 PM
I have identified the issue. The driver memory is exhausting and the worker nodes are not coming into play...
02-27-2023 10:22 PM
@uzair mustafa
Check the Ganglia for performance related issues (maybe it's getting OOM?).
02-27-2023 10:49 PM
Hi,
I have been checking Ganglia. Free Space is about 300GB available.
02-28-2023 10:41 PM
@uzair mustafa
It's hard to answer without digging into the logs and code.
02-28-2023 11:12 PM
I have identified the issue. The driver memory is exhausting and the worker nodes are not coming into play...
02-28-2023 11:34 PM
@uzair mustafa
So basically, that's what I expected, OOM 🙂
It's good that you were able to find an issue.
03-12-2023 09:53 PM
Hi @uzair mustafa
Hope everything is going great.
Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.
Cheers!
03-12-2023 10:09 PM
04-26-2023 01:10 AM
Did you solve this issue ? I'm in similar situation.
04-26-2023 03:24 AM
@Jeetash Kumar I identified the issue, which was the driver memory was getting exhausted. I fine tuned my code so that lesser operations are done on the driver side and I reduced the concurrency of my tasks. This answer is based on my use case.
04-26-2023 03:25 AM
You can take a look at your driver memory by looking at the Ganglia UI, monitor it as it your cluster runs..
04-26-2023 04:09 AM
Yes, I'm monitoring driver memory in Ganglia (Attaching SS of the driver node).
what might be the list of operations are done on the driver side which I need to avoid ?
04-26-2023 04:11 AM
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group