cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Mlflowexception: "Connection broken: ConnectionResetError(104, \\\'Connection reset by peer\\\')"

cl2
New Contributor II

Hello,

I have a workflow running which from time to time crashes with the error:

MlflowException: The following failures occurred while downloading one or more artifacts from models:/incubator-forecast-charging-demand-power-and-io-dk2/Production: {'python_model.pkl': 'MlflowException(\'("Connection broken: ConnectionResetError(104, \\\'Connection reset by peer\\\')", ConnectionResetError(104, \\\'Connection reset by peer\\\'))\')'}

I don't really know how to interpret this exception. Further more it does not happen every day, and I have yet to find any reason to why it crashes. 

Any suggestion what might be wrong?

4 REPLIES 4

Ayushi_Suthar
Databricks Employee
Databricks Employee

Hi @cl2,Thanks for bringing up your concerns; always happy to help 😁

Upon going through the details, it appears there was an HTTP connection error downloading artifacts. This typically shouldn’t happen, but it can occur intermittently as a transient network issue resulting in artifactory failure.

As an immediate workaround, we would recommend adding a retry around the model download logic. We could also add a couple of retries to this logic with 1 second of sleep between the retries - time.sleep(1) to clear the issue.

Leave a like if this helps, followups are appreciated.
Kudos

Ayushi

cl2
New Contributor II

Hi @Ayushi_Suthar 

It happens 1 or 2 times a week, so quiet frequently.

Can you elaborate a bit more on how I should integrate the retry and retries logic? If there's some examples or documentation somewhere it would be much appreciated 🙂

Ayushi_Suthar
Databricks Employee
Databricks Employee

Hi @cl2 , Thank you for writing us back! 

You can optionally configure a retry policy for your task within a Job. The retry interval is calculated in milliseconds between the start of the failed run and the subsequent retry run. 

You can go through these documents for more details: 

https://docs.databricks.com/en/workflows/jobs/settings.html#configure-a-retry-policy-for-a-task

https://docs.databricks.com/en/workflows/jobs/create-run-jobs.html#:~:text=To%20optionally%20configu....

Please let me know if this helps and leave a like if this helps, followups are appreciated.
Kudos
Ayushi

cl2
New Contributor II

Hi @Ayushi_Suthar,

I am using DBX to deploy the workflows - do you have some documents with details how to implement it with DBX?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group