Mlflowexception: "Connection broken: ConnectionResetError(104, \\\'Connection reset by peer\\\')"
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-05-2024 11:39 PM
Hello,
I have a workflow running which from time to time crashes with the error:
MlflowException: The following failures occurred while downloading one or more artifacts from models:/incubator-forecast-charging-demand-power-and-io-dk2/Production: {'python_model.pkl': 'MlflowException(\'("Connection broken: ConnectionResetError(104, \\\'Connection reset by peer\\\')", ConnectionResetError(104, \\\'Connection reset by peer\\\'))\')'}I don't really know how to interpret this exception. Further more it does not happen every day, and I have yet to find any reason to why it crashes.
Any suggestion what might be wrong?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-07-2024 09:57 PM
Hi @cl2,Thanks for bringing up your concerns; always happy to help 😁
Upon going through the details, it appears there was an HTTP connection error downloading artifacts. This typically shouldn’t happen, but it can occur intermittently as a transient network issue resulting in artifactory failure.
As an immediate workaround, we would recommend adding a retry around the model download logic. We could also add a couple of retries to this logic with 1 second of sleep between the retries - time.sleep(1) to clear the issue.
Leave a like if this helps, followups are appreciated.
Kudos
Ayushi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-09-2024 02:34 AM
Hi @Ayushi_Suthar
It happens 1 or 2 times a week, so quiet frequently.
Can you elaborate a bit more on how I should integrate the retry and retries logic? If there's some examples or documentation somewhere it would be much appreciated 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-10-2024 04:29 AM
Hi @cl2 , Thank you for writing us back!
You can optionally configure a retry policy for your task within a Job. The retry interval is calculated in milliseconds between the start of the failed run and the subsequent retry run.
You can go through these documents for more details:
https://docs.databricks.com/en/workflows/jobs/settings.html#configure-a-retry-policy-for-a-task
Please let me know if this helps and leave a like if this helps, followups are appreciated.
Kudos
Ayushi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-15-2024 11:25 PM
Hi @Ayushi_Suthar,
I am using DBX to deploy the workflows - do you have some documents with details how to implement it with DBX?