Bug: MLflow connection fails after 2d
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-05-2023 07:49 AM
Hi there,
After exactly 2d of training, the following error is raised after an API call to MLflow:
ValueError: Enum ErrorCode has no value defined for name '403'
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:42, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
41 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 42 return trainer_fn(*args, **kwargs)
44 except _TunerExitException:
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:568, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
562 ckpt_path = self._checkpoint_connector._select_ckpt_path(
563 self.state.fn,
564 ckpt_path,
565 model_provided=True,
566 model_connected=self.lightning_module is not None,
567 )
--> 568 self._run(model, ckpt_path=ckpt_path)
570 assert self.state.stopped
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:973, in Trainer._run(self, model, ckpt_path)
970 # ----------------------------
971 # RUN THE TRAINER
972 # ----------------------------
--> 973 results = self._run_stage()
975 # ----------------------------
976 # POST-Training CLEAN UP
977 # ----------------------------
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1016, in Trainer._run_stage(self)
1015 with torch.autograd.set_detect_anomaly(self._detect_anomaly):
-> 1016 self.fit_loop.run()
1017 return None
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py:201, in _FitLoop.run(self)
200 self.on_advance_start()
--> 201 self.advance()
202 self.on_advance_end()
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py:354, in _FitLoop.advance(self)
353 with self.trainer.profiler.profile("run_training_epoch"):
--> 354 self.epoch_loop.run(self._data_fetcher)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py:133, in _TrainingEpochLoop.run(self, data_fetcher)
132 try:
--> 133 self.advance(data_fetcher)
134 self.on_advance_end()
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py:206, in _TrainingEpochLoop.advance(self, data_fetcher)
204 else:
205 # hook
--> 206 call._call_callback_hooks(trainer, "on_train_batch_start", batch, batch_idx)
207 response = call._call_lightning_module_hook(trainer, "on_train_batch_start", batch, batch_idx)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:193, in _call_callback_hooks(trainer, hook_name, monitoring_callbacks, *args, **kwargs)
192 with trainer.profiler.profile(f"[Callback]{callback.state_key}.{hook_name}"):
--> 193 fn(trainer, trainer.lightning_module, *args, **kwargs)
195 if pl_module:
196 # restore current_fx when nested context
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/callbacks/lr_monitor.py:158, in LearningRateMonitor.on_train_batch_start(self, trainer, *args, **kwargs)
157 for logger in trainer.loggers:
--> 158 logger.log_metrics(latest_stat, step=trainer.fit_loop.epoch_loop._batches_that_stepped)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py:32, in rank_zero_only.<locals>.wrapped_fn(*args, **kwargs)
31 if rank == 0:
---> 32 return fn(*args, **kwargs)
33 return None
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loggers/mlflow.py:272, in MLFlowLogger.log_metrics(self, metrics, step)
270 metrics_list.append(Metric(key=k, value=v, timestamp=timestamp_ms, step=step or 0))
--> 272 self.experiment.log_batch(run_id=self.run_id, metrics=metrics_list)
File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/client.py:965, in MlflowClient.log_batch(self, run_id, metrics, params, tags)
915 """
916 Log multiple metrics, params, and/or tags.
917
(...)
963 status: FINISHED
964 """
--> 965 self._tracking_client.log_batch(run_id, metrics, params, tags)
File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py:394, in TrackingServiceClient.log_batch(self, run_id, metrics, params, tags)
393 for metrics_batch in chunk_list(metrics, chunk_size=MAX_METRICS_PER_BATCH):
--> 394 self.store.log_batch(run_id=run_id, metrics=metrics_batch, params=[], tags=[])
File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:321, in RestStore.log_batch(self, run_id, metrics, params, tags)
318 req_body = message_to_json(
319 LogBatch(metrics=metric_protos, params=param_protos, tags=tag_protos, run_id=run_id)
320 )
--> 321 self._call_endpoint(LogBatch, req_body)
File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:56, in RestStore._call_endpoint(self, api, json_body)
55 response_proto = api.Response()
---> 56 return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:281, in call_endpoint(host_creds, endpoint, method, json_body, response_proto)
278 response = http_request(
279 host_creds=host_creds, endpoint=endpoint, method=method, json=json_body
280 )
--> 281 response = verify_rest_response(response, endpoint)
282 js_dict = json.loads(response.text)
File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:207, in verify_rest_response(response, endpoint)
206 if _can_parse_as_json_object(response.text):
--> 207 raise RestException(json.loads(response.text))
208 else:
File /databricks/python/lib/python3.9/site-packages/mlflow/exceptions.py:102, in RestException.__init__(self, json)
98 message = "{}: {}".format(
99 error_code,
100 json["message"] if "message" in json else "Response: " + str(json),
101 )
--> 102 super().__init__(message, error_code=ErrorCode.Value(error_code))
103 self.json = json
File /databricks/python/lib/python3.9/site-packages/google/protobuf/internal/enum_type_wrapper.py:73, in EnumTypeWrapper.Value(self, name)
72 pass # fall out to break exception chaining
---> 73 raise ValueError('Enum {} has no value defined for name {!r}'.format(
74 self._enum_type.name, name))
ValueError: Enum ErrorCode has no value defined for name '403'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
File <command-1886660812327855>:15
1 torch.set_float32_matmul_precision(hparams.float32_matmul_precision)
2 with (
3 train_converter.make_torch_dataloader(
4 batch_size=hparams.batch_size, num_epochs=1
(...)
13 # batch["tokens"].to('cpu')
14 # pl_module._forward_with_loss(batch,"debug")
---> 15 trainer.fit(pl_module, train_dl, val_dl)
16 clear_pl_module()
File /databricks/python/lib/python3.9/site-packages/mlflow/utils/autologging_utils/safety.py:435, in safe_patch.<locals>.safe_patch_function(*args, **kwargs)
420 if (
421 active_session_failed
422 or autologging_is_disabled(autologging_integration)
(...)
429 # warning behavior during original function execution, since autologging is being
430 # skipped
431 with set_non_mlflow_warnings_behavior_for_current_thread(
432 disable_warnings=False,
433 reroute_warnings=False,
434 ):
--> 435 return original(*args, **kwargs)
437 # Whether or not the original / underlying function has been called during the
438 # execution of patched code
439 original_has_been_called = False
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:529, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
527 model = _maybe_unwrap_optimized(model)
528 self.strategy._lightning_module = model
--> 529 call._call_and_handle_interrupt(
530 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
531 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:65, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
63 trainer.strategy.on_exception(exception)
64 for logger in trainer.loggers:
---> 65 logger.finalize("failed")
66 trainer._teardown()
67 # teardown might access the stage so we reset it after
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py:32, in rank_zero_only.<locals>.wrapped_fn(*args, **kwargs)
30 raise RuntimeError("The `rank_zero_only.rank` needs to be set before use")
31 if rank == 0:
---> 32 return fn(*args, **kwargs)
33 return None
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loggers/mlflow.py:289, in MLFlowLogger.finalize(self, status)
286 if self._checkpoint_callback:
287 self._scan_and_log_checkpoints(self._checkpoint_callback)
--> 289 if self.experiment.get_run(self.run_id):
290 self.experiment.set_terminated(self.run_id, status)
File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/client.py:150, in MlflowClient.get_run(self, run_id)
112 def get_run(self, run_id: str) -> Run:
113 """
114 Fetch the run from backend store. The resulting :py:class:`Run <mlflow.entities.Run>`
115 contains a collection of run metadata -- :py:class:`RunInfo <mlflow.entities.RunInfo>`,
(...)
148 status: FINISHED
149 """
--> 150 return self._tracking_client.get_run(run_id)
File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py:72, in TrackingServiceClient.get_run(self, run_id)
58 """
59 Fetch the run from backend store. The resulting :py:class:`Run <mlflow.entities.Run>`
60 contains a collection of run metadata -- :py:class:`RunInfo <mlflow.entities.RunInfo>`,
(...)
69 raises an exception.
70 """
71 _validate_run_id(run_id)
---> 72 return self.store.get_run(run_id)
File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:134, in RestStore.get_run(self, run_id)
126 """
127 Fetch the run from backend store
128
(...)
131 :return: A single Run object if it exists, otherwise raises an Exception
132 """
133 req_body = message_to_json(GetRun(run_uuid=run_id, run_id=run_id))
--> 134 response_proto = self._call_endpoint(GetRun, req_body)
135 return Run.from_proto(response_proto.run)
File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:56, in RestStore._call_endpoint(self, api, json_body)
54 endpoint, method = _METHOD_TO_INFO[api]
55 response_proto = api.Response()
---> 56 return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:281, in call_endpoint(host_creds, endpoint, method, json_body, response_proto)
277 else:
278 response = http_request(
279 host_creds=host_creds, endpoint=endpoint, method=method, json=json_body
280 )
--> 281 response = verify_rest_response(response, endpoint)
282 js_dict = json.loads(response.text)
283 parse_dict(js_dict=js_dict, message=response_proto)
File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:207, in verify_rest_response(response, endpoint)
205 if response.status_code != 200:
206 if _can_parse_as_json_object(response.text):
--> 207 raise RestException(json.loads(response.text))
208 else:
209 base_msg = "API request to endpoint {} failed with error code {} != 200".format(
210 endpoint,
211 response.status_code,
212 )
File /databricks/python/lib/python3.9/site-packages/mlflow/exceptions.py:102, in RestException.__init__(self, json)
97 error_code = json.get("error_code", ErrorCode.Name(INTERNAL_ERROR))
98 message = "{}: {}".format(
99 error_code,
100 json["message"] if "message" in json else "Response: " + str(json),
101 )
--> 102 super().__init__(message, error_code=ErrorCode.Value(error_code))
103 self.json = json
File /databricks/python/lib/python3.9/site-packages/google/protobuf/internal/enum_type_wrapper.py:73, in EnumTypeWrapper.Value(self, name)
71 except KeyError:
72 pass # fall out to break exception chaining
---> 73 raise ValueError('Enum {} has no value defined for name {!r}'.format(
74 self._enum_type.name, name))
ValueError: Enum ErrorCode has no value defined for name '403'
Config details:
12.2 LTS ML (includes Apache Spark 3.3.2, GPU, Scala 2.12)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-28-2024 03:50 PM
Hello @Alex42 !
The error message indicates that access is forbidden due to an expired access token. This occurs when a notebook or job runs for an extended period, exceeding the default 48-hour threshold set for security reasons. The Databricks access token used by the MLflow Python client to communicate with the tracking server has a limited lifespan, typically expiring after 48 hours. If your ML tasks take longer than this to complete, the access token will expire, resulting in MLflow calls failing with a 403 Invalid access token error.
To resolve the issue, please follow these steps:
- Create a Personal Access Token (PAT) by following the instructions outlined in this documentation: https://docs.databricks.com/en/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspa.... Please keep the current token’s default lifetime of 90 days.
- Next, using the Databricks SDK, perform authentication at the beginning of your code by using the following commands:
from databricks.sdk import WorkspaceClient
import os
os.environ["DATABRICKS_TOKEN"] = "PAT-you-generated-in-step-one"
os.environ["DATABRICKS_HOST"] = "https://<YOUR_DATABRICKS_WORKSPACE_URL>;"
w = WorkspaceClient(
host = os.environ["DATABRICKS_HOST"],
token = os.environ["DATABRICKS_TOKEN"]
)
Alternatively, you can store your PAT in a secret and perform this authentication in a more elegant and secure way, as described in steps 2 and 3 of this Knowledge Base article: https://kb.databricks.com/en_US/machine-learning/mlflow-invalid-access-token-error.
Best Regards,
Jéssica Santos

