cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Remote RPC client disassociated error

Alix
New Contributor III

Hello,

I've been trying to submit a job to a transient cluster, but it is failing with this error :

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7) (10.139.64.5 executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

the same job works fine on an interactive cluster with same specs (also the job is pretty tiny so I don't get that containers exceeding ...), I don't specify anything special at cluster creation, and I don't install any special libraries regarding spark ... I'm running out of idea on what could be the error, any clues ?

Thanks ๐Ÿ™‚

1 ACCEPTED SOLUTION

Accepted Solutions

shan_chandra
Databricks Employee
Databricks Employee

@Alix Mรฉtivierโ€‹  - The error is thrown from the user code (please investigate the jar file attached to the cluster).

at m80.dbruniv_0_1.dbruniv.tFixedFlowInput_1Process(dbruniv.java:941)

at m80.dbruniv_0_1.dbruniv.run(dbruniv.java:1654)

at m80.dbruniv_0_1.dbruniv.runJobInTOS(dbruniv.java

View solution in original post

8 REPLIES 8

Anonymous
Not applicable

Hello, @Alix Mรฉtivierโ€‹ - My name is Piper and I'm a moderator for Databricks. Welcome to the community and thank you for your question. We'll give it a while to see what your fellow members have to say. We'll circle back around if we need to.

Thanks in advance for your patience.

AmanSehgal
Honored Contributor III

@Alix Mรฉtivierโ€‹  could you check the output of the entire notebook or of each cell?

There's a size limit on cell output and for entire notebook.

Alix
New Contributor III

Hi @Aman Sehgalโ€‹ , I just had a run with the option spark.databricks.driver.disableScalaOutput set to true and the error is still there

I'm not using notebook but I'm using Runs submit with a java jar

AmanSehgal
Honored Contributor III

What else can you grab from spark logs?

Alix
New Contributor III

thats all the logs i get and theres not much help inside

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Alix Mรฉtivierโ€‹ ,

Are you able to get the logs for executor 4? it seems like these logs are from the driver, not the executor.

Alix
New Contributor III

issue was caused by the fact that I set spark.serializer", "org.apache.spark.serializer.KryoSerializer" and "spark.kryo.registrator" in the spark conf of the transient cluster

after removing them its working, does that mean that databricks does not support kryo with transient ?

shan_chandra
Databricks Employee
Databricks Employee

@Alix Mรฉtivierโ€‹  - The error is thrown from the user code (please investigate the jar file attached to the cluster).

at m80.dbruniv_0_1.dbruniv.tFixedFlowInput_1Process(dbruniv.java:941)

at m80.dbruniv_0_1.dbruniv.run(dbruniv.java:1654)

at m80.dbruniv_0_1.dbruniv.runJobInTOS(dbruniv.java

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group