Databricks Community

Alix · ‎02-21-2022

Hello,

I've been trying to submit a job to a transient cluster, but it is failing with this error :

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7) (10.139.64.5 executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

the same job works fine on an interactive cluster with same specs (also the job is pretty tiny so I don't get that containers exceeding ...), I don't specify anything special at cluster creation, and I don't install any special libraries regarding spark ... I'm running out of idea on what could be the error, any clues ?

Thanks 🙂

shan_chandra · ‎05-10-2022

@Alix Métivier - The error is thrown from the user code (please investigate the jar file attached to the cluster).

at m80.dbruniv_0_1.dbruniv.tFixedFlowInput_1Process(dbruniv.java:941)

at m80.dbruniv_0_1.dbruniv.run(dbruniv.java:1654)

at m80.dbruniv_0_1.dbruniv.runJobInTOS(dbruniv.java

View solution in original post

Anonymous · ‎02-21-2022

Hello, @Alix Métivier - My name is Piper and I'm a moderator for Databricks. Welcome to the community and thank you for your question. We'll give it a while to see what your fellow members have to say. We'll circle back around if we need to.

Thanks in advance for your patience.

AmanSehgal · ‎02-21-2022

@Alix Métivier could you check the output of the entire notebook or of each cell?

There's a size limit on cell output and for entire notebook.

Alix · ‎02-22-2022

Hi @Aman Sehgal , I just had a run with the option spark.databricks.driver.disableScalaOutput set to true and the error is still there

I'm not using notebook but I'm using Runs submit with a java jar

AmanSehgal · ‎02-22-2022

What else can you grab from spark logs?

Alix · ‎02-22-2022

thats all the logs i get and theres not much help inside

jose_gonzalez · ‎06-07-2022

Hi @Alix Métivier ,

Are you able to get the logs for executor 4? it seems like these logs are from the driver, not the executor.

Alix · ‎02-23-2022

issue was caused by the fact that I set spark.serializer", "org.apache.spark.serializer.KryoSerializer" and "spark.kryo.registrator" in the spark conf of the transient cluster

after removing them its working, does that mean that databricks does not support kryo with transient ?