- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-06-2023 06:54 AM
Hello,
This is question on our platform with `Databricks Runtime 11.3 LTS`.
I'm running a Job with multiple tasks in // using a shared cluster.
Each task runs a dedicated scala class within a JAR library attached as a dependency.
One of the task fails (code related error) and a retry is performed as expected. Unfortunately, this retry continuously fails with the following error message:
command--1:1: error: not found: value fakepackage
fakepackage.fakenamespace.fakeclass.main(Array("environmentCode=VALUE","silverPath=s3a://silverpath","goldPath=s3a://goldpath/")
It does not fail when the job starts; issue only happens if there is a retry.
Previously, I was running these tasks independently (1 task per job) with task-level cluster and the retry was working fine.
Is there an issue in databricks with attaching the JAR library to the shared cluster? (not an issue before because I had a brand new cluster for a retry?)
Thank you for your help.
Attached is the JSON to create the job with the Jobs API 2.1.
- Labels:
-
Error Message
-
JAR Library
-
Job Cluster
-
JOBS
-
Multi
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-13-2023 01:21 AM
Retry failed because the jar library is unresponsive. You need to kill it on the cluster or restart the cluster after it happens.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-07-2023 08:12 AM
try to use 10.3/4LTS version and let us know if it is still failing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-12-2023 12:03 AM
Hello,
Unfortunately we can't downgrade to 10.3/4LTS (Spark 3.2.1) as we are using some features from latest Spark 3.3.0.
We upgraded to 12.0 and we are currently monitoring; we will let you know if it gets better.
Side note: We observed that the 1st job crash is due to an OOM on the Driver. Looks like the retry restarts a Driver and does not manage to attach the library.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-12-2023 08:17 AM
@Aviral Bhardwajโ Quick update: Issue still happens on 12.0 ๐
Attached are the Driver OOM crash logs from the Standard Error output.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-12-2023 05:14 PM
never seen this type of error
can you check with help@databricks.com they will help you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-13-2023 01:21 AM
Retry failed because the jar library is unresponsive. You need to kill it on the cluster or restart the cluster after it happens.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-13-2023 01:32 AM
Thank you folks; this is useful.
We've got an updated error message this morning; still on a task retry with a shared cluster. (error below)
You meant that a jar library could be unresponsive (this is not a process though?) and we would have to manually kill it? I'm expecting the automated retry to handle it... Looks like a databricks bug @Hubert Dudekโ
Your explanation confirms what we observed as it's working properly without shared-cluster (1 dedicated cluster per task). With task-level cluster, it is fully restarted.
Nevertheless, we would like to keep usage of shared-clusters for multiple tasks. (having 1 cluster per task does not make sense in our use case)
Run result unavailable: job failed with error message
Library installation failed for library due to user error for jar: "dbfs:/shared/jarFileName.jar"
. Error messages:
Library installation failed after PENDING for 10 minutes since cluster entered RUNNING state. Error Code: SPARK_CONTEXT_MISMATCH_FAILURE. Cannot get library installation state for cluster [0113-010003-id]. This can occur if the driver was recently restarted or terminated
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ09-11-2023 05:54 AM
Hi,
This actually should not be marked as solved. We are having the same problem, whenever a Shared Job Cluster crashes for some reason (generally OoM), all tasks will start failing until eternity, with the error message as described above. This is actually a pretty grave bug in the Multi-Task Databricks Workflows, which makes them basically unusable. If you have to do a whole bunch of manual intervention steps when something goes wrong, what is the point?