cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Error when running job in databricks

Benji
New Contributor II

Hello, I am very new with databricks and MLflow. I faced with the problem about running job. When the job is run, it usually failed and retried itself, so it incasesed running time, i.e., from normally 6 hrs to 12-18 hrs.

Fail image 

From the error log, it shows that the error came from this point.

    # df_master_scored = df_master_scored.join(df_master, ["du_spine_primary_key"], how="left")
    df_master_scored.write.format("delta").mode("overwrite").saveAsTable(
        delta_table_schema + ".l5_du_scored_" + control_group
    )

Furthermore, the error I found usually showed like this:

Py4JJavaError: An error occurred while calling o36819.saveAsTable.
: org.apache.spark.SparkException: Job aborted.

Then, it shows the cause that:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 349.0 failed 4 times, most recent failure: Lost task 6.3 in stage 349.0 (TID 128171, 10.0.2.18, executor 22): org.apache.spark.api.python.PythonException: 'mlflow.exceptions.MlflowException: API request to https://southeastasia.azuredatabricks.net/api/2.0/mlflow/runs/search failed with exception HTTPSConnectionPool(host='southeastasia.azuredatabricks.net', port=443): Max retries exceeded with url: /api/2.0/mlflow/runs/search (Caused by ResponseError('too many 429 error responses'))'.

Sometimes, the cause changed to be like this (but only showed in the latest job running):

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 366.0 failed 4 times, most recent failure: Lost task 13.3 in stage 366.0 (TID 128315, 10.0.2.7, executor 19): ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 153563 ms

I don't know how to solve this issue. It would be related to MLflow problem. Anyway, it increase a lot of cost.

Any suggestion for solving this issue?

5 REPLIES 5

-werners-
Esteemed Contributor III

Can you try with .option("overwriteSchema", "true")

Benji
New Contributor II

Okay, I have added already. Let's see the result tonight. 😀

Benji
New Contributor II

I just checked the job which was run last night. It seem doesn't help. I still face with the same error and job retry itself automatically again.

job_retry

-werners-
Esteemed Contributor III

I think you will have to debug your notebook to see where the issue actually arises.

The error pops up at writing the data because that is an action (and spark code is only executed at an action).

But the cause of the error seems to be somewhere upstream.

So try a .show or display(df) cell by cell to see where you get an error.

Vidula
Honored Contributor

Hey there @Tanawat Benchasirirot​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group