cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Mlflowexception: "Connection broken: ConnectionResetError(104, \\\'Connection reset by peer\\\')"

cl2
New Contributor II

Hello,

I have a workflow running which from time to time crashes with the error:

MlflowException: The following failures occurred while downloading one or more artifacts from models:/incubator-forecast-charging-demand-power-and-io-dk2/Production: {'python_model.pkl': 'MlflowException(\'("Connection broken: ConnectionResetError(104, \\\'Connection reset by peer\\\')", ConnectionResetError(104, \\\'Connection reset by peer\\\'))\')'}

I don't really know how to interpret this exception. Further more it does not happen every day, and I have yet to find any reason to why it crashes. 

Any suggestion what might be wrong?

5 REPLIES 5

Ayushi_Suthar
Honored Contributor
Honored Contributor

Hi @cl2,Thanks for bringing up your concerns; always happy to help 😁

Upon going through the details, it appears there was an HTTP connection error downloading artifacts. This typically shouldn’t happen, but it can occur intermittently as a transient network issue resulting in artifactory failure.

As an immediate workaround, we would recommend adding a retry around the model download logic. We could also add a couple of retries to this logic with 1 second of sleep between the retries - time.sleep(1) to clear the issue.

Leave a like if this helps, followups are appreciated.
Kudos

Ayushi

cl2
New Contributor II

Hi @Ayushi_Suthar 

It happens 1 or 2 times a week, so quiet frequently.

Can you elaborate a bit more on how I should integrate the retry and retries logic? If there's some examples or documentation somewhere it would be much appreciated 🙂

Ayushi_Suthar
Honored Contributor
Honored Contributor

Hi @cl2 , Thank you for writing us back! 

You can optionally configure a retry policy for your task within a Job. The retry interval is calculated in milliseconds between the start of the failed run and the subsequent retry run. 

You can go through these documents for more details: 

https://docs.databricks.com/en/workflows/jobs/settings.html#configure-a-retry-policy-for-a-task

https://docs.databricks.com/en/workflows/jobs/create-run-jobs.html#:~:text=To%20optionally%20configu....

Please let me know if this helps and leave a like if this helps, followups are appreciated.
Kudos
Ayushi

cl2
New Contributor II

Hi @Ayushi_Suthar,

I am using DBX to deploy the workflows - do you have some documents with details how to implement it with DBX?

Kaniz
Community Manager
Community Manager

Hey there! Thanks a bunch for being part of our awesome community! 🎉 

We love having you around and appreciate all your questions. Take a moment to check out the responses – you'll find some great info. Your input is valuable, so pick the best solution for you. And remember, if you ever need more help , we're here for you! 

Keep being awesome! 😊🚀

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.