Databricks

uzairm · ‎02-27-2023

Hi, I am running concurrent notebooks in concurrent workflow jobs in job compute cluster c5a.8xlarge with 5-7 worker nodes. Each job has 100 concurrent child notebooks and there are 10 job instances. 8/10 jobs gives the error the spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

How can I resolve that?

uzairm · ‎02-28-2023

I have identified the issue. The driver memory is exhausting and the worker nodes are not coming into play...

View solution in original post

daniel_sahal · ‎02-27-2023

@uzair mustafa

Check the Ganglia for performance related issues (maybe it's getting OOM?).

uzairm · ‎02-27-2023

Hi,

I have been checking Ganglia. Free Space is about 300GB available.

daniel_sahal · ‎02-28-2023

@uzair mustafa

It's hard to answer without digging into the logs and code.

uzairm · ‎02-28-2023

I have identified the issue. The driver memory is exhausting and the worker nodes are not coming into play...

daniel_sahal · ‎02-28-2023

@uzair mustafa

So basically, that's what I expected, OOM 🙂

It's good that you were able to find an issue.

Anonymous · ‎03-12-2023

Hi @uzair mustafa

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.

Cheers!

uzairm · ‎03-12-2023

Hi Vidula,
I still face the issue and it has not been resolved. It would be great if some one helps me.

JKR · ‎04-26-2023

Did you solve this issue ? I'm in similar situation.

uzairm · ‎04-26-2023

@Jeetash Kumar I identified the issue, which was the driver memory was getting exhausted. I fine tuned my code so that lesser operations are done on the driver side and I reduced the concurrency of my tasks. This answer is based on my use case.

uzairm · ‎04-26-2023

You can take a look at your driver memory by looking at the Ganglia UI, monitor it as it your cluster runs..

JKR · ‎04-26-2023

Yes, I'm monitoring driver memory in Ganglia (Attaching SS of the driver node).

what might be the list of operations are done on the driver side which I need to avoid ?

uzairm · ‎04-26-2023

Operations like collect() select() are done on driver node. All aggregations. Avoid those.

Databricks

Concurrent Jobs - The spark driver has stopped unexpectedly!

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs