07-25-2022 11:47 PM
Hello, I am very new with databricks and MLflow. I faced with the problem about running job. When the job is run, it usually failed and retried itself, so it incasesed running time, i.e., from normally 6 hrs to 12-18 hrs.
From the error log, it shows that the error came from this point.
# df_master_scored = df_master_scored.join(df_master, ["du_spine_primary_key"], how="left")
df_master_scored.write.format("delta").mode("overwrite").saveAsTable(
delta_table_schema + ".l5_du_scored_" + control_group
)
Furthermore, the error I found usually showed like this:
Py4JJavaError: An error occurred while calling o36819.saveAsTable.
: org.apache.spark.SparkException: Job aborted.
Then, it shows the cause that:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 349.0 failed 4 times, most recent failure: Lost task 6.3 in stage 349.0 (TID 128171, 10.0.2.18, executor 22): org.apache.spark.api.python.PythonException: 'mlflow.exceptions.MlflowException: API request to https://southeastasia.azuredatabricks.net/api/2.0/mlflow/runs/search failed with exception HTTPSConnectionPool(host='southeastasia.azuredatabricks.net', port=443): Max retries exceeded with url: /api/2.0/mlflow/runs/search (Caused by ResponseError('too many 429 error responses'))'.
Sometimes, the cause changed to be like this (but only showed in the latest job running):
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 366.0 failed 4 times, most recent failure: Lost task 13.3 in stage 366.0 (TID 128315, 10.0.2.7, executor 19): ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 153563 ms
I don't know how to solve this issue. It would be related to MLflow problem. Anyway, it increase a lot of cost.
Any suggestion for solving this issue?
07-26-2022 12:06 AM
Can you try with .option("overwriteSchema", "true")
07-26-2022 12:42 AM
Okay, I have added already. Let's see the result tonight. 😀
07-26-2022 06:41 PM
07-27-2022 04:10 AM
I think you will have to debug your notebook to see where the issue actually arises.
The error pops up at writing the data because that is an action (and spark code is only executed at an action).
But the cause of the error seems to be somewhere upstream.
So try a .show or display(df) cell by cell to see where you get an error.
09-05-2022 06:25 AM
Hey there @Tanawat Benchasirirot
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group