cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

ExecutorLostFailure: Remote RPC Client Disassociated

McKayHarris
New Contributor II

This is an expensive and long-running job that gets about halfway done before failing. The stack trace is included below, but here is the salient part:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4881 in stage 1.0 failed 4 times, most recent failure: Lost task 4881.3 in stage 1.0 (TID 7305, 10.37.129.129): ExecutorLostFailure (executor 116 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

This job has been running fine for months up to this point. I have tried increasing the node size, but still receiving this error. I saw from a Google search that this might be YARN not having enough provisioned memory, but I was under the impression that it was all configured under the hood with Databricks. Should I try tweaking these values? If so which?

Here is the full stack-trace:

---------------------------------------------------------------------------Py4JJavaError                             Traceback (most recent call last)
<ipython-input-2-b42178413d3f> in <module>()    254     )
    255--> 256final_data.write         .format('com.databricks.spark.redshift').option('preactions', delete_stmt).option('url', REDSHIFT_URL).option('dbtable', load_table).option('tempdir', REDSHIFT_TEMPDIR +'/courses_monthly').mode("append").save()/databricks/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)    528             self.format(format)    529if path is None:--> 530self._jwrite.save()    531else:    532             self._jwrite.save(path)/databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self, *args)    931         answer = self.gateway_client.send_command(command)    932         return_value = get_return_value(
--> 933             answer, self.gateway_client, self.target_id, self.name)
    934    935for temp_arg in temp_args:/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw)     61def deco(*a,**kw):     62try:---> 63return f(*a,**kw)     64except py4j.protocol.Py4JJavaError as e:     65             s = e.java_exception.toString()/databricks/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)    310                 raise Py4JJavaError(
    311"An error occurred while calling {0}{1}{2}.\n".--> 312                     format(target_id, ".", name), value)
    313else:    314                 raise Py4JError(

 

Py4JJavaError: An error occurred while calling o305.save.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:149)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
    at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:511)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
    at com.databricks.spark.redshift.RedshiftWriter.unloadData(RedshiftWriter.scala:278)
    at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:346)
    at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:106)
    at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:443)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:211)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4881 in stage 1.0 failed 4 times, most recent failure: Lost task 4881.3 in stage 1.0 (TID 7305, 10.37.129.129): ExecutorLostFailure (executor 116 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1452)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1440)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1439)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1439)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1665)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1868)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1881)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1901)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:143)
    ... 34 more 

17 REPLIES 17

RodrigoDe_Freit
New Contributor II

According to https://docs.databricks.com/jobs.html#jar-job-tips

Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger size, the run will be canceled and marked as failed.

That was my problem, to "fix it" I've just set the logging level to ERROR

val sc = SparkContext.getOrCreate(conf)

sc.setLogLevel("ERROR")

It was solved

RodrigoDe_Freit
New Contributor II

According to https://docs.databricks.com/jobs.html#jar-job-tips:

"Job output, such as log output emitted to stdout, is subject to a 20MB size limit. If the total output has a larger size, the run will be canceled and marked as failed."

That was my problem, to "fix it" I've just set the logging level to ERROR

val sc = SparkContext.getOrCreate(conf)

sc.setLogLevel("ERROR")

This workaround works for me

I am facing the same error but the log output to stdout is not an issue as the log file size turns out to be < 2 MB. So that issue is ruled out. Moreover, our job is dummy for testing purposes and is not doing any memory intensive operation. Its purely running a simple thread that keeps on logging to the stdout every 5 mins.

Still the cluster is getting timed out.

Below is the post i have submitted on stack overflow.

https://stackoverflow.com/questions/59820940/databricks-job-timed-out-with-error-lost-executor-0-on-...

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.