โ02-27-2023 12:29 PM
Hi, I am running concurrent notebooks in concurrent workflow jobs in job compute cluster c5a.8xlarge with 5-7 worker nodes. Each job has 100 concurrent child notebooks and there are 10 job instances. 8/10 jobs gives the error the spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.
How can I resolve that?
โ02-28-2023 11:12 PM
I have identified the issue. The driver memory is exhausting and the worker nodes are not coming into play...
โ02-27-2023 10:22 PM
@uzair mustafaโ
Check the Ganglia for performance related issues (maybe it's getting OOM?).
โ02-27-2023 10:49 PM
Hi,
I have been checking Ganglia. Free Space is about 300GB available.
โ02-28-2023 10:41 PM
@uzair mustafaโ
It's hard to answer without digging into the logs and code.
โ02-28-2023 11:12 PM
I have identified the issue. The driver memory is exhausting and the worker nodes are not coming into play...
โ02-28-2023 11:34 PM
@uzair mustafaโ
So basically, that's what I expected, OOM ๐
It's good that you were able to find an issue.
โ03-12-2023 09:53 PM
Hi @uzair mustafaโ
Hope everything is going great.
Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.
Cheers!
โ03-12-2023 10:09 PM
โ04-26-2023 01:10 AM
Did you solve this issue ? I'm in similar situation.
โ04-26-2023 03:24 AM
@Jeetash Kumarโ I identified the issue, which was the driver memory was getting exhausted. I fine tuned my code so that lesser operations are done on the driver side and I reduced the concurrency of my tasks. This answer is based on my use case.
โ04-26-2023 03:25 AM
You can take a look at your driver memory by looking at the Ganglia UI, monitor it as it your cluster runs..
โ04-26-2023 04:09 AM
Yes, I'm monitoring driver memory in Ganglia (Attaching SS of the driver node).
what might be the list of operations are done on the driver side which I need to avoid ?
โ04-26-2023 04:11 AM
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group