Hello, I am very new with databricks and MLflow. I faced with the problem about running job. When the job is run, it usually failed and retried itself, so it incasesed running time, i.e., from normally 6 hrs to 12-18 hrs.
From the error log, it shows that the error came from this point.
# df_master_scored = df_master_scored.join(df_master, ["du_spine_primary_key"], how="left")
df_master_scored.write.format("delta").mode("overwrite").saveAsTable(
delta_table_schema + ".l5_du_scored_" + control_group
)
Furthermore, the error I found usually showed like this:
Py4JJavaError: An error occurred while calling o36819.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
Then, it shows the cause that:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 349.0 failed 4 times, most recent failure: Lost task 6.3 in stage 349.0 (TID 128171, 10.0.2.18, executor 22): org.apache.spark.api.python.PythonException: 'mlflow.exceptions.MlflowException: API request to https://southeastasia.azuredatabricks.net/api/2.0/mlflow/runs/search failed with exception HTTPSConnectionPool(host='southeastasia.azuredatabricks.net', port=443): Max retries exceeded with url: /api/2.0/mlflow/runs/search (Caused by ResponseError('too many 429 error responses'))'.
Sometimes, the cause changed to be like this (but only showed in the latest job running):
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 366.0 failed 4 times, most recent failure: Lost task 13.3 in stage 366.0 (TID 128315, 10.0.2.7, executor 19): ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 153563 ms
I don't know how to solve this issue. It would be related to MLflow problem. Anyway, it increase a lot of cost.
Any suggestion for solving this issue?