01-06-2023 06:54 AM
Hello,
This is question on our platform with `Databricks Runtime 11.3 LTS`.
I'm running a Job with multiple tasks in // using a shared cluster.
Each task runs a dedicated scala class within a JAR library attached as a dependency.
One of the task fails (code related error) and a retry is performed as expected. Unfortunately, this retry continuously fails with the following error message:
command--1:1: error: not found: value fakepackage
fakepackage.fakenamespace.fakeclass.main(Array("environmentCode=VALUE","silverPath=s3a://silverpath","goldPath=s3a://goldpath/")
It does not fail when the job starts; issue only happens if there is a retry.
Previously, I was running these tasks independently (1 task per job) with task-level cluster and the retry was working fine.
Is there an issue in databricks with attaching the JAR library to the shared cluster? (not an issue before because I had a brand new cluster for a retry?)
Thank you for your help.
Attached is the JSON to create the job with the Jobs API 2.1.
01-13-2023 01:21 AM
Retry failed because the jar library is unresponsive. You need to kill it on the cluster or restart the cluster after it happens.
01-07-2023 08:12 AM
try to use 10.3/4LTS version and let us know if it is still failing
01-12-2023 12:03 AM
Hello,
Unfortunately we can't downgrade to 10.3/4LTS (Spark 3.2.1) as we are using some features from latest Spark 3.3.0.
We upgraded to 12.0 and we are currently monitoring; we will let you know if it gets better.
Side note: We observed that the 1st job crash is due to an OOM on the Driver. Looks like the retry restarts a Driver and does not manage to attach the library.
01-12-2023 08:17 AM
@Aviral Bhardwaj Quick update: Issue still happens on 12.0 😞
Attached are the Driver OOM crash logs from the Standard Error output.
01-12-2023 05:14 PM
never seen this type of error
can you check with help@databricks.com they will help you
01-13-2023 01:21 AM
Retry failed because the jar library is unresponsive. You need to kill it on the cluster or restart the cluster after it happens.
01-13-2023 01:32 AM
Thank you folks; this is useful.
We've got an updated error message this morning; still on a task retry with a shared cluster. (error below)
You meant that a jar library could be unresponsive (this is not a process though?) and we would have to manually kill it? I'm expecting the automated retry to handle it... Looks like a databricks bug @Hubert Dudek
Your explanation confirms what we observed as it's working properly without shared-cluster (1 dedicated cluster per task). With task-level cluster, it is fully restarted.
Nevertheless, we would like to keep usage of shared-clusters for multiple tasks. (having 1 cluster per task does not make sense in our use case)
Run result unavailable: job failed with error message
Library installation failed for library due to user error for jar: "dbfs:/shared/jarFileName.jar"
. Error messages:
Library installation failed after PENDING for 10 minutes since cluster entered RUNNING state. Error Code: SPARK_CONTEXT_MISMATCH_FAILURE. Cannot get library installation state for cluster [0113-010003-id]. This can occur if the driver was recently restarted or terminated
09-11-2023 05:54 AM
Hi,
This actually should not be marked as solved. We are having the same problem, whenever a Shared Job Cluster crashes for some reason (generally OoM), all tasks will start failing until eternity, with the error message as described above. This is actually a pretty grave bug in the Multi-Task Databricks Workflows, which makes them basically unusable. If you have to do a whole bunch of manual intervention steps when something goes wrong, what is the point?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group