Databricks

Ludo · ‎01-06-2023

Hello,

This is question on our platform with `Databricks Runtime 11.3 LTS`.

I'm running a Job with multiple tasks in // using a shared cluster.

Each task runs a dedicated scala class within a JAR library attached as a dependency.

One of the task fails (code related error) and a retry is performed as expected. Unfortunately, this retry continuously fails with the following error message:

command--1:1: error: not found: value fakepackage
fakepackage.fakenamespace.fakeclass.main(Array("environmentCode=VALUE","silverPath=s3a://silverpath","goldPath=s3a://goldpath/")

It does not fail when the job starts; issue only happens if there is a retry.

Previously, I was running these tasks independently (1 task per job) with task-level cluster and the retry was working fine.

Is there an issue in databricks with attaching the JAR library to the shared cluster? (not an issue before because I had a brand new cluster for a retry?)

Thank you for your help.

Attached is the JSON to create the job with the Jobs API 2.1.

Hubert-Dudek · ‎01-13-2023

Retry failed because the jar library is unresponsive. You need to kill it on the cluster or restart the cluster after it happens.

View solution in original post

Aviral-Bhardwaj · ‎01-07-2023

try to use 10.3/4LTS version and let us know if it is still failing

Ludo · ‎01-12-2023

Hello,

Unfortunately we can't downgrade to 10.3/4LTS (Spark 3.2.1) as we are using some features from latest Spark 3.3.0.

We upgraded to 12.0 and we are currently monitoring; we will let you know if it gets better.

Side note: We observed that the 1st job crash is due to an OOM on the Driver. Looks like the retry restarts a Driver and does not manage to attach the library.

Ludo · ‎01-12-2023

@Aviral Bhardwaj Quick update: Issue still happens on 12.0 😞

Attached are the Driver OOM crash logs from the Standard Error output.

Aviral-Bhardwaj · ‎01-12-2023

never seen this type of error

can you check with help@databricks.com they will help you

Hubert-Dudek · ‎01-13-2023

Retry failed because the jar library is unresponsive. You need to kill it on the cluster or restart the cluster after it happens.

Ludo · ‎01-13-2023

Thank you folks; this is useful.

We've got an updated error message this morning; still on a task retry with a shared cluster. (error below)

You meant that a jar library could be unresponsive (this is not a process though?) and we would have to manually kill it? I'm expecting the automated retry to handle it... Looks like a databricks bug @Hubert Dudek

Your explanation confirms what we observed as it's working properly without shared-cluster (1 dedicated cluster per task). With task-level cluster, it is fully restarted.

Nevertheless, we would like to keep usage of shared-clusters for multiple tasks. (having 1 cluster per task does not make sense in our use case)

Run result unavailable: job failed with error message
 Library installation failed for library due to user error for jar: "dbfs:/shared/jarFileName.jar"
. Error messages:
Library installation failed after PENDING for 10 minutes since cluster entered RUNNING state. Error Code: SPARK_CONTEXT_MISMATCH_FAILURE. Cannot get library installation state for cluster [0113-010003-id]. This can occur if the driver was recently restarted or terminated

YoshiCoppens61 · ‎09-11-2023

Hi,

This actually should not be marked as solved. We are having the same problem, whenever a Shared Job Cluster crashes for some reason (generally OoM), all tasks will start failing until eternity, with the error message as described above. This is actually a pretty grave bug in the Multi-Task Databricks Workflows, which makes them basically unusable. If you have to do a whole bunch of manual intervention steps when something goes wrong, what is the point?

Databricks

Jobs with multi-tasking are failing to retry; how to fix this issue?

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs