- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-27-2023 12:29 PM
Hi, I am running concurrent notebooks in concurrent workflow jobs in job compute cluster c5a.8xlarge with 5-7 worker nodes. Each job has 100 concurrent child notebooks and there are 10 job instances. 8/10 jobs gives the error the spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.
How can I resolve that?
- Labels:
-
Concurrent notebooks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-27-2023 10:22 PM
@uzair mustafa
Check the Ganglia for performance related issues (maybe it's getting OOM?).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-27-2023 10:49 PM
Hi,
I have been checking Ganglia. Free Space is about 300GB available.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-28-2023 10:41 PM
@uzair mustafa
It's hard to answer without digging into the logs and code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-28-2023 11:12 PM
I have identified the issue. The driver memory is exhausting and the worker nodes are not coming into play...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-28-2023 11:34 PM
@uzair mustafa
So basically, that's what I expected, OOM 🙂
It's good that you were able to find an issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-12-2023 09:53 PM
Hi @uzair mustafa
Hope everything is going great.
Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.
Cheers!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-12-2023 10:09 PM
I still face the issue and it has not been resolved. It would be great if some one helps me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-26-2023 01:10 AM
Did you solve this issue ? I'm in similar situation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-26-2023 03:24 AM
@Jeetash Kumar I identified the issue, which was the driver memory was getting exhausted. I fine tuned my code so that lesser operations are done on the driver side and I reduced the concurrency of my tasks. This answer is based on my use case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-26-2023 03:25 AM
You can take a look at your driver memory by looking at the Ganglia UI, monitor it as it your cluster runs..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-26-2023 04:09 AM
Yes, I'm monitoring driver memory in Ganglia (Attaching SS of the driver node).
what might be the list of operations are done on the driver side which I need to avoid ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-26-2023 04:11 AM