cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Error when running job in databricks

Benji
New Contributor II

Hello, I am very new with databricks and MLflow. I faced with the problem about running job. When the job is run, it usually failed and retried itself, so it incasesed running time, i.e., from normally 6 hrs to 12-18 hrs.

Fail image 

From the error log, it shows that the error came from this point.

    # df_master_scored = df_master_scored.join(df_master, ["du_spine_primary_key"], how="left")
    df_master_scored.write.format("delta").mode("overwrite").saveAsTable(
        delta_table_schema + ".l5_du_scored_" + control_group
    )

Furthermore, the error I found usually showed like this:

Py4JJavaError: An error occurred while calling o36819.saveAsTable.
: org.apache.spark.SparkException: Job aborted.

Then, it shows the cause that:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 349.0 failed 4 times, most recent failure: Lost task 6.3 in stage 349.0 (TID 128171, 10.0.2.18, executor 22): org.apache.spark.api.python.PythonException: 'mlflow.exceptions.MlflowException: API request to https://southeastasia.azuredatabricks.net/api/2.0/mlflow/runs/search failed with exception HTTPSConnectionPool(host='southeastasia.azuredatabricks.net', port=443): Max retries exceeded with url: /api/2.0/mlflow/runs/search (Caused by ResponseError('too many 429 error responses'))'.

Sometimes, the cause changed to be like this (but only showed in the latest job running):

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 366.0 failed 4 times, most recent failure: Lost task 13.3 in stage 366.0 (TID 128315, 10.0.2.7, executor 19): ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 153563 ms

I don't know how to solve this issue. It would be related to MLflow problem. Anyway, it increase a lot of cost.

Any suggestion for solving this issue?

5 REPLIES 5

-werners-
Esteemed Contributor III

Can you try with .option("overwriteSchema", "true")

Benji
New Contributor II

Okay, I have added already. Let's see the result tonight. ๐Ÿ˜€

Benji
New Contributor II

I just checked the job which was run last night. It seem doesn't help. I still face with the same error and job retry itself automatically again.

job_retry

-werners-
Esteemed Contributor III

I think you will have to debug your notebook to see where the issue actually arises.

The error pops up at writing the data because that is an action (and spark code is only executed at an action).

But the cause of the error seems to be somewhere upstream.

So try a .show or display(df) cell by cell to see where you get an error.

Vidula
Honored Contributor

Hey there @Tanawat Benchasirirotโ€‹ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!