cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Jobs with multi-tasking are failing to retry; how to fix this issue?

Ludo
New Contributor III

Hello,

This is question on our platform with `Databricks Runtime 11.3 LTS`.

I'm running a Job with multiple tasks in // using a shared cluster.

Each task runs a dedicated scala class within a JAR library attached as a dependency.

One of the task fails (code related error) and a retry is performed as expected. Unfortunately, this retry continuously fails with the following error message:

command--1:1: error: not found: value fakepackage
fakepackage.fakenamespace.fakeclass.main(Array("environmentCode=VALUE","silverPath=s3a://silverpath","goldPath=s3a://goldpath/")

It does not fail when the job starts; issue only happens if there is a retry.

Previously, I was running these tasks independently (1 task per job) with task-level cluster and the retry was working fine.

Is there an issue in databricks with attaching the JAR library to the shared cluster? (not an issue before because I had a brand new cluster for a retry?)

Thank you for your help.

Attached is the JSON to create the job with the Jobs API 2.1.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

Retry failed because the jar library is unresponsive. You need to kill it on the cluster or restart the cluster after it happens.

View solution in original post

7 REPLIES 7

Aviral-Bhardwaj
Esteemed Contributor III

try to use 10.3/4LTS version and let us know if it is still failing

AviralBhardwaj

Ludo
New Contributor III

Hello,

Unfortunately we can't downgrade to 10.3/4LTS (Spark 3.2.1) as we are using some features from latest Spark 3.3.0.

We upgraded to 12.0 and we are currently monitoring; we will let you know if it gets better.

Side note: We observed that the 1st job crash is due to an OOM on the Driver. Looks like the retry restarts a Driver and does not manage to attach the library.

Ludo
New Contributor III

@Aviral Bhardwajโ€‹ Quick update: Issue still happens on 12.0 ๐Ÿ˜ž

Attached are the Driver OOM crash logs from the Standard Error output.

Aviral-Bhardwaj
Esteemed Contributor III

never seen this type of error

can you check with help@databricks.com they will help you

AviralBhardwaj

Hubert-Dudek
Esteemed Contributor III

Retry failed because the jar library is unresponsive. You need to kill it on the cluster or restart the cluster after it happens.

Ludo
New Contributor III

Thank you folks; this is useful.

We've got an updated error message this morning; still on a task retry with a shared cluster. (error below)

You meant that a jar library could be unresponsive (this is not a process though?) and we would have to manually kill it? I'm expecting the automated retry to handle it... Looks like a databricks bug @Hubert Dudekโ€‹  

Your explanation confirms what we observed as it's working properly without shared-cluster (1 dedicated cluster per task). With task-level cluster, it is fully restarted.

Nevertheless, we would like to keep usage of shared-clusters for multiple tasks. (having 1 cluster per task does not make sense in our use case)

Run result unavailable: job failed with error message
 Library installation failed for library due to user error for jar: "dbfs:/shared/jarFileName.jar"
. Error messages:
Library installation failed after PENDING for 10 minutes since cluster entered RUNNING state. Error Code: SPARK_CONTEXT_MISMATCH_FAILURE. Cannot get library installation state for cluster [0113-010003-id]. This can occur if the driver was recently restarted or terminated

YoshiCoppens61
New Contributor II

Hi,

This actually should not be marked as solved. We are having the same problem, whenever a Shared Job Cluster crashes for some reason (generally OoM), all tasks will start failing until eternity, with the error message as described above. This is actually a pretty grave bug in the Multi-Task Databricks Workflows, which makes them basically unusable. If you have to do a whole bunch of manual intervention steps when something goes wrong, what is the point?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group