cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Remote RPC client disassociated error

Alix
New Contributor III

Hello,

I've been trying to submit a job to a transient cluster, but it is failing with this error :

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 7) (10.139.64.5 executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

the same job works fine on an interactive cluster with same specs (also the job is pretty tiny so I don't get that containers exceeding ...), I don't specify anything special at cluster creation, and I don't install any special libraries regarding spark ... I'm running out of idea on what could be the error, any clues ?

Thanks 🙂

1 ACCEPTED SOLUTION

Accepted Solutions

shan_chandra
Esteemed Contributor
Esteemed Contributor

@Alix Métivier​  - The error is thrown from the user code (please investigate the jar file attached to the cluster).

at m80.dbruniv_0_1.dbruniv.tFixedFlowInput_1Process(dbruniv.java:941)

at m80.dbruniv_0_1.dbruniv.run(dbruniv.java:1654)

at m80.dbruniv_0_1.dbruniv.runJobInTOS(dbruniv.java

View solution in original post

9 REPLIES 9

Anonymous
Not applicable

Hello, @Alix Métivier​ - My name is Piper and I'm a moderator for Databricks. Welcome to the community and thank you for your question. We'll give it a while to see what your fellow members have to say. We'll circle back around if we need to.

Thanks in advance for your patience.

AmanSehgal
Honored Contributor III

@Alix Métivier​  could you check the output of the entire notebook or of each cell?

There's a size limit on cell output and for entire notebook.

Alix
New Contributor III

Hi @Aman Sehgal​ , I just had a run with the option spark.databricks.driver.disableScalaOutput set to true and the error is still there

I'm not using notebook but I'm using Runs submit with a java jar

AmanSehgal
Honored Contributor III

What else can you grab from spark logs?

Alix
New Contributor III

thats all the logs i get and theres not much help inside

Hi @Alix Métivier​ ,

Are you able to get the logs for executor 4? it seems like these logs are from the driver, not the executor.

Alix
New Contributor III

issue was caused by the fact that I set spark.serializer", "org.apache.spark.serializer.KryoSerializer" and "spark.kryo.registrator" in the spark conf of the transient cluster

after removing them its working, does that mean that databricks does not support kryo with transient ?

shan_chandra
Esteemed Contributor
Esteemed Contributor

@Alix Métivier​  - The error is thrown from the user code (please investigate the jar file attached to the cluster).

at m80.dbruniv_0_1.dbruniv.tFixedFlowInput_1Process(dbruniv.java:941)

at m80.dbruniv_0_1.dbruniv.run(dbruniv.java:1654)

at m80.dbruniv_0_1.dbruniv.runJobInTOS(dbruniv.java

Kaniz
Community Manager
Community Manager

Hi @Alix Métivier​ , Just a friendly follow-up. Do you still need help, or @Shanmugavel Chandrakasu​ 's response help you to find the solution? Please let us know.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!