02-21-2022 09:00 AM
Hello,
I've been trying to submit a job to a transient cluster, but it is failing with this error :
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7) (10.139.64.5 executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
the same job works fine on an interactive cluster with same specs (also the job is pretty tiny so I don't get that containers exceeding ...), I don't specify anything special at cluster creation, and I don't install any special libraries regarding spark ... I'm running out of idea on what could be the error, any clues ?
Thanks 🙂
05-10-2022 07:02 PM
@Alix Métivier - The error is thrown from the user code (please investigate the jar file attached to the cluster).
at m80.dbruniv_0_1.dbruniv.tFixedFlowInput_1Process(dbruniv.java:941)
at m80.dbruniv_0_1.dbruniv.run(dbruniv.java:1654)
at m80.dbruniv_0_1.dbruniv.runJobInTOS(dbruniv.java
02-21-2022 11:11 AM
Hello, @Alix Métivier - My name is Piper and I'm a moderator for Databricks. Welcome to the community and thank you for your question. We'll give it a while to see what your fellow members have to say. We'll circle back around if we need to.
Thanks in advance for your patience.
02-21-2022 06:16 PM
@Alix Métivier could you check the output of the entire notebook or of each cell?
There's a size limit on cell output and for entire notebook.
02-22-2022 12:28 AM
Hi @Aman Sehgal , I just had a run with the option spark.databricks.driver.disableScalaOutput set to true and the error is still there
I'm not using notebook but I'm using Runs submit with a java jar
02-22-2022 04:06 AM
What else can you grab from spark logs?
02-22-2022 05:18 AM
thats all the logs i get and theres not much help inside
06-07-2022 09:16 AM
Hi @Alix Métivier ,
Are you able to get the logs for executor 4? it seems like these logs are from the driver, not the executor.
02-23-2022 12:42 AM
issue was caused by the fact that I set spark.serializer", "org.apache.spark.serializer.KryoSerializer" and "spark.kryo.registrator" in the spark conf of the transient cluster
after removing them its working, does that mean that databricks does not support kryo with transient ?
05-10-2022 07:02 PM
@Alix Métivier - The error is thrown from the user code (please investigate the jar file attached to the cluster).
at m80.dbruniv_0_1.dbruniv.tFixedFlowInput_1Process(dbruniv.java:941)
at m80.dbruniv_0_1.dbruniv.run(dbruniv.java:1654)
at m80.dbruniv_0_1.dbruniv.runJobInTOS(dbruniv.java
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group