Mlflowexception: "Connection broken: ConnectionResetError(104, \\\'Connection reset by peer\\\')"

cl2
New Contributor II

Hello,

I have a workflow running which from time to time crashes with the error:

MlflowException: The following failures occurred while downloading one or more artifacts from models:/incubator-forecast-charging-demand-power-and-io-dk2/Production: {'python_model.pkl': 'MlflowException(\'("Connection broken: ConnectionResetError(104, \\\'Connection reset by peer\\\')", ConnectionResetError(104, \\\'Connection reset by peer\\\'))\')'}

I don't really know how to interpret this exception. Further more it does not happen every day, and I have yet to find any reason to why it crashes. 

Any suggestion what might be wrong?

Ayushi_Suthar
Databricks Employee
Databricks Employee

Hi @cl2,Thanks for bringing up your concerns; always happy to help 😁

Upon going through the details, it appears there was an HTTP connection error downloading artifacts. This typically shouldn’t happen, but it can occur intermittently as a transient network issue resulting in artifactory failure.

As an immediate workaround, we would recommend adding a retry around the model download logic. We could also add a couple of retries to this logic with 1 second of sleep between the retries - time.sleep(1) to clear the issue.

Leave a like if this helps, followups are appreciated.
Kudos

Ayushi

cl2
New Contributor II

Hi @Ayushi_Suthar 

It happens 1 or 2 times a week, so quiet frequently.

Can you elaborate a bit more on how I should integrate the retry and retries logic? If there's some examples or documentation somewhere it would be much appreciated 🙂

Ayushi_Suthar
Databricks Employee
Databricks Employee

Hi @cl2 , Thank you for writing us back! 

You can optionally configure a retry policy for your task within a Job. The retry interval is calculated in milliseconds between the start of the failed run and the subsequent retry run. 

You can go through these documents for more details: 

https://docs.databricks.com/en/workflows/jobs/settings.html#configure-a-retry-policy-for-a-task

https://docs.databricks.com/en/workflows/jobs/create-run-jobs.html#:~:text=To%20optionally%20configu....

Please let me know if this helps and leave a like if this helps, followups are appreciated.
Kudos
Ayushi

cl2
New Contributor II

Hi @Ayushi_Suthar,

I am using DBX to deploy the workflows - do you have some documents with details how to implement it with DBX?