Databricks Community

Benji · ‎07-25-2022

Hello, I am very new with databricks and MLflow. I faced with the problem about running job. When the job is run, it usually failed and retried itself, so it incasesed running time, i.e., from normally 6 hrs to 12-18 hrs.

From the error log, it shows that the error came from this point.

    # df_master_scored = df_master_scored.join(df_master, ["du_spine_primary_key"], how="left")
    df_master_scored.write.format("delta").mode("overwrite").saveAsTable(
        delta_table_schema + ".l5_du_scored_" + control_group
    )

Furthermore, the error I found usually showed like this:

Py4JJavaError: An error occurred while calling o36819.saveAsTable.
: org.apache.spark.SparkException: Job aborted.

Then, it shows the cause that:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 349.0 failed 4 times, most recent failure: Lost task 6.3 in stage 349.0 (TID 128171, 10.0.2.18, executor 22): org.apache.spark.api.python.PythonException: 'mlflow.exceptions.MlflowException: API request to https://southeastasia.azuredatabricks.net/api/2.0/mlflow/runs/search failed with exception HTTPSConnectionPool(host='southeastasia.azuredatabricks.net', port=443): Max retries exceeded with url: /api/2.0/mlflow/runs/search (Caused by ResponseError('too many 429 error responses'))'.

Sometimes, the cause changed to be like this (but only showed in the latest job running):

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 366.0 failed 4 times, most recent failure: Lost task 13.3 in stage 366.0 (TID 128315, 10.0.2.7, executor 19): ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 153563 ms

I don't know how to solve this issue. It would be related to MLflow problem. Anyway, it increase a lot of cost.

Any suggestion for solving this issue?

-werners- · ‎07-26-2022

Can you try with .option("overwriteSchema", "true")

Benji · ‎07-26-2022

Okay, I have added already. Let's see the result tonight. 😀

Benji · ‎07-26-2022

I just checked the job which was run last night. It seem doesn't help. I still face with the same error and job retry itself automatically again.

-werners- · ‎07-27-2022

I think you will have to debug your notebook to see where the issue actually arises.

The error pops up at writing the data because that is an action (and spark code is only executed at an action).

But the cause of the error seems to be somewhere upstream.

So try a .show or display(df) cell by cell to see where you get an error.

Vidula · ‎09-05-2022

Hey there @Tanawat Benchasirirot

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!