โ02-21-2022 09:00 AM
Hello,
I've been trying to submit a job to a transient cluster, but it is failing with this error :
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7) (10.139.64.5 executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
the same job works fine on an interactive cluster with same specs (also the job is pretty tiny so I don't get that containers exceeding ...), I don't specify anything special at cluster creation, and I don't install any special libraries regarding spark ... I'm running out of idea on what could be the error, any clues ?
Thanks ๐
โ05-10-2022 07:02 PM
@Alix Mรฉtivierโ - The error is thrown from the user code (please investigate the jar file attached to the cluster).
at m80.dbruniv_0_1.dbruniv.tFixedFlowInput_1Process(dbruniv.java:941)
at m80.dbruniv_0_1.dbruniv.run(dbruniv.java:1654)
at m80.dbruniv_0_1.dbruniv.runJobInTOS(dbruniv.java
โ02-21-2022 11:11 AM
Hello, @Alix Mรฉtivierโ - My name is Piper and I'm a moderator for Databricks. Welcome to the community and thank you for your question. We'll give it a while to see what your fellow members have to say. We'll circle back around if we need to.
Thanks in advance for your patience.
โ02-21-2022 06:16 PM
@Alix Mรฉtivierโ could you check the output of the entire notebook or of each cell?
There's a size limit on cell output and for entire notebook.
โ02-22-2022 12:28 AM
Hi @Aman Sehgalโ , I just had a run with the option spark.databricks.driver.disableScalaOutput set to true and the error is still there
I'm not using notebook but I'm using Runs submit with a java jar
โ02-22-2022 04:06 AM
What else can you grab from spark logs?
โ02-22-2022 05:18 AM
thats all the logs i get and theres not much help inside
โ06-07-2022 09:16 AM
Hi @Alix Mรฉtivierโ ,
Are you able to get the logs for executor 4? it seems like these logs are from the driver, not the executor.
โ02-23-2022 12:42 AM
issue was caused by the fact that I set spark.serializer", "org.apache.spark.serializer.KryoSerializer" and "spark.kryo.registrator" in the spark conf of the transient cluster
after removing them its working, does that mean that databricks does not support kryo with transient ?
โ05-10-2022 07:02 PM
@Alix Mรฉtivierโ - The error is thrown from the user code (please investigate the jar file attached to the cluster).
at m80.dbruniv_0_1.dbruniv.tFixedFlowInput_1Process(dbruniv.java:941)
at m80.dbruniv_0_1.dbruniv.run(dbruniv.java:1654)
at m80.dbruniv_0_1.dbruniv.runJobInTOS(dbruniv.java
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group