MLflow stopped working after one a few successful runs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-09-2022 08:27 AM
Both the mlflow.log_metrics() call and the web UI worked in the first few days but started failing at some point. The log doesn't give any clue why this is happening. It's suspicious but is there a limit of mlflow requests? It's quite annoying that the web UI will stop displaying the finished mlflow runs with no message showing why this is happening.
File "/home/ec2-user/anaconda3/envs/abr/lib/python3.8/site-packages/mlflow/tracking/fluent.py", line 1050, in get_experiment_by_name
return MlflowClient().get_experiment_by_name(name)
File "/home/ec2-user/anaconda3/envs/abr/lib/python3.8/site-packages/mlflow/tracking/client.py", line 451, in get_experiment_by_name
return self._tracking_client.get_experiment_by_name(name)
File "/home/ec2-user/anaconda3/envs/abr/lib/python3.8/site-packages/mlflow/tracking/_tracking_service/client.py", line 197, in get_experiment_by_name
return self.store.get_experiment_by_name(name)
File "/home/ec2-user/anaconda3/envs/abr/lib/python3.8/site-packages/mlflow/store/tracking/rest_store.py", line 290, in get_experiment_by_name
response_proto = self._call_endpoint(GetExperimentByName, req_body)
File "/home/ec2-user/anaconda3/envs/abr/lib/python3.8/site-packages/mlflow/store/tracking/rest_store.py", line 56, in _call_endpoint
return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File "/home/ec2-user/anaconda3/envs/abr/lib/python3.8/site-packages/mlflow/utils/rest_utils.py", line 273, in call_endpoint
response = http_request(
File "/home/ec2-user/anaconda3/envs/abr/lib/python3.8/site-packages/mlflow/utils/rest_utils.py", line 184, in http_request
raise MlflowException("API request to %s failed with exception %s" % (url, e))
mlflow.exceptions.MlflowException: API request to https://community.cloud.databricks.com/api/2.0/mlflow/experiments/get-by-name failed with exception HTTPSConnectionPool(host='community.cloud.databricks.com', port=443): Max retries exceeded with url: /api/2.0/mlflow/experiments/get-by-name?experiment_name=%2Fabr-sagemaker-mid (Caused by ResponseError('too many 429 error responses'))
- Labels:
-
MlFlow
-
Successful Runs
-
Web ui
data:image/s3,"s3://crabby-images/42b93/42b9345c42b8f7964a094a99d153a8dc1c5eb2fb" alt=""
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-09-2022 08:42 AM
The MLflow slack may be a good place to ask:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-09-2022 09:00 AM
It seems Databricks disabled showing existing runs. There's a small pop-up window flashing for half a second saying "This endpoint has temporarily been disabled. Please contact Databricks support". Almost missed it as the window shows up in random manner and is ephemeral.
data:image/s3,"s3://crabby-images/2345c/2345ca6ff2e34b0d370ce03453929e5fd0c4a88d" alt=""
data:image/s3,"s3://crabby-images/2345c/2345ca6ff2e34b0d370ce03453929e5fd0c4a88d" alt=""