โ09-05-2023 07:49 AM
Hi there,
After exactly 2d of training, the following error is raised after an API call to MLflow:
ValueError: Enum ErrorCode has no value defined for name '403'
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:42, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
41 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 42 return trainer_fn(*args, **kwargs)
44 except _TunerExitException:
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:568, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
562 ckpt_path = self._checkpoint_connector._select_ckpt_path(
563 self.state.fn,
564 ckpt_path,
565 model_provided=True,
566 model_connected=self.lightning_module is not None,
567 )
--> 568 self._run(model, ckpt_path=ckpt_path)
570 assert self.state.stopped
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:973, in Trainer._run(self, model, ckpt_path)
970 # ----------------------------
971 # RUN THE TRAINER
972 # ----------------------------
--> 973 results = self._run_stage()
975 # ----------------------------
976 # POST-Training CLEAN UP
977 # ----------------------------
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1016, in Trainer._run_stage(self)
1015 with torch.autograd.set_detect_anomaly(self._detect_anomaly):
-> 1016 self.fit_loop.run()
1017 return None
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py:201, in _FitLoop.run(self)
200 self.on_advance_start()
--> 201 self.advance()
202 self.on_advance_end()
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py:354, in _FitLoop.advance(self)
353 with self.trainer.profiler.profile("run_training_epoch"):
--> 354 self.epoch_loop.run(self._data_fetcher)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py:133, in _TrainingEpochLoop.run(self, data_fetcher)
132 try:
--> 133 self.advance(data_fetcher)
134 self.on_advance_end()
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py:206, in _TrainingEpochLoop.advance(self, data_fetcher)
204 else:
205 # hook
--> 206 call._call_callback_hooks(trainer, "on_train_batch_start", batch, batch_idx)
207 response = call._call_lightning_module_hook(trainer, "on_train_batch_start", batch, batch_idx)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:193, in _call_callback_hooks(trainer, hook_name, monitoring_callbacks, *args, **kwargs)
192 with trainer.profiler.profile(f"[Callback]{callback.state_key}.{hook_name}"):
--> 193 fn(trainer, trainer.lightning_module, *args, **kwargs)
195 if pl_module:
196 # restore current_fx when nested context
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/callbacks/lr_monitor.py:158, in LearningRateMonitor.on_train_batch_start(self, trainer, *args, **kwargs)
157 for logger in trainer.loggers:
--> 158 logger.log_metrics(latest_stat, step=trainer.fit_loop.epoch_loop._batches_that_stepped)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py:32, in rank_zero_only.<locals>.wrapped_fn(*args, **kwargs)
31 if rank == 0:
---> 32 return fn(*args, **kwargs)
33 return None
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loggers/mlflow.py:272, in MLFlowLogger.log_metrics(self, metrics, step)
270 metrics_list.append(Metric(key=k, value=v, timestamp=timestamp_ms, step=step or 0))
--> 272 self.experiment.log_batch(run_id=self.run_id, metrics=metrics_list)
File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/client.py:965, in MlflowClient.log_batch(self, run_id, metrics, params, tags)
915 """
916 Log multiple metrics, params, and/or tags.
917
(...)
963 status: FINISHED
964 """
--> 965 self._tracking_client.log_batch(run_id, metrics, params, tags)
File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py:394, in TrackingServiceClient.log_batch(self, run_id, metrics, params, tags)
393 for metrics_batch in chunk_list(metrics, chunk_size=MAX_METRICS_PER_BATCH):
--> 394 self.store.log_batch(run_id=run_id, metrics=metrics_batch, params=[], tags=[])
File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:321, in RestStore.log_batch(self, run_id, metrics, params, tags)
318 req_body = message_to_json(
319 LogBatch(metrics=metric_protos, params=param_protos, tags=tag_protos, run_id=run_id)
320 )
--> 321 self._call_endpoint(LogBatch, req_body)
File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:56, in RestStore._call_endpoint(self, api, json_body)
55 response_proto = api.Response()
---> 56 return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:281, in call_endpoint(host_creds, endpoint, method, json_body, response_proto)
278 response = http_request(
279 host_creds=host_creds, endpoint=endpoint, method=method, json=json_body
280 )
--> 281 response = verify_rest_response(response, endpoint)
282 js_dict = json.loads(response.text)
File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:207, in verify_rest_response(response, endpoint)
206 if _can_parse_as_json_object(response.text):
--> 207 raise RestException(json.loads(response.text))
208 else:
File /databricks/python/lib/python3.9/site-packages/mlflow/exceptions.py:102, in RestException.__init__(self, json)
98 message = "{}: {}".format(
99 error_code,
100 json["message"] if "message" in json else "Response: " + str(json),
101 )
--> 102 super().__init__(message, error_code=ErrorCode.Value(error_code))
103 self.json = json
File /databricks/python/lib/python3.9/site-packages/google/protobuf/internal/enum_type_wrapper.py:73, in EnumTypeWrapper.Value(self, name)
72 pass # fall out to break exception chaining
---> 73 raise ValueError('Enum {} has no value defined for name {!r}'.format(
74 self._enum_type.name, name))
ValueError: Enum ErrorCode has no value defined for name '403'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
File <command-1886660812327855>:15
1 torch.set_float32_matmul_precision(hparams.float32_matmul_precision)
2 with (
3 train_converter.make_torch_dataloader(
4 batch_size=hparams.batch_size, num_epochs=1
(...)
13 # batch["tokens"].to('cpu')
14 # pl_module._forward_with_loss(batch,"debug")
---> 15 trainer.fit(pl_module, train_dl, val_dl)
16 clear_pl_module()
File /databricks/python/lib/python3.9/site-packages/mlflow/utils/autologging_utils/safety.py:435, in safe_patch.<locals>.safe_patch_function(*args, **kwargs)
420 if (
421 active_session_failed
422 or autologging_is_disabled(autologging_integration)
(...)
429 # warning behavior during original function execution, since autologging is being
430 # skipped
431 with set_non_mlflow_warnings_behavior_for_current_thread(
432 disable_warnings=False,
433 reroute_warnings=False,
434 ):
--> 435 return original(*args, **kwargs)
437 # Whether or not the original / underlying function has been called during the
438 # execution of patched code
439 original_has_been_called = False
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:529, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
527 model = _maybe_unwrap_optimized(model)
528 self.strategy._lightning_module = model
--> 529 call._call_and_handle_interrupt(
530 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
531 )
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:65, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
63 trainer.strategy.on_exception(exception)
64 for logger in trainer.loggers:
---> 65 logger.finalize("failed")
66 trainer._teardown()
67 # teardown might access the stage so we reset it after
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py:32, in rank_zero_only.<locals>.wrapped_fn(*args, **kwargs)
30 raise RuntimeError("The `rank_zero_only.rank` needs to be set before use")
31 if rank == 0:
---> 32 return fn(*args, **kwargs)
33 return None
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loggers/mlflow.py:289, in MLFlowLogger.finalize(self, status)
286 if self._checkpoint_callback:
287 self._scan_and_log_checkpoints(self._checkpoint_callback)
--> 289 if self.experiment.get_run(self.run_id):
290 self.experiment.set_terminated(self.run_id, status)
File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/client.py:150, in MlflowClient.get_run(self, run_id)
112 def get_run(self, run_id: str) -> Run:
113 """
114 Fetch the run from backend store. The resulting :py:class:`Run <mlflow.entities.Run>`
115 contains a collection of run metadata -- :py:class:`RunInfo <mlflow.entities.RunInfo>`,
(...)
148 status: FINISHED
149 """
--> 150 return self._tracking_client.get_run(run_id)
File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py:72, in TrackingServiceClient.get_run(self, run_id)
58 """
59 Fetch the run from backend store. The resulting :py:class:`Run <mlflow.entities.Run>`
60 contains a collection of run metadata -- :py:class:`RunInfo <mlflow.entities.RunInfo>`,
(...)
69 raises an exception.
70 """
71 _validate_run_id(run_id)
---> 72 return self.store.get_run(run_id)
File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:134, in RestStore.get_run(self, run_id)
126 """
127 Fetch the run from backend store
128
(...)
131 :return: A single Run object if it exists, otherwise raises an Exception
132 """
133 req_body = message_to_json(GetRun(run_uuid=run_id, run_id=run_id))
--> 134 response_proto = self._call_endpoint(GetRun, req_body)
135 return Run.from_proto(response_proto.run)
File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:56, in RestStore._call_endpoint(self, api, json_body)
54 endpoint, method = _METHOD_TO_INFO[api]
55 response_proto = api.Response()
---> 56 return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:281, in call_endpoint(host_creds, endpoint, method, json_body, response_proto)
277 else:
278 response = http_request(
279 host_creds=host_creds, endpoint=endpoint, method=method, json=json_body
280 )
--> 281 response = verify_rest_response(response, endpoint)
282 js_dict = json.loads(response.text)
283 parse_dict(js_dict=js_dict, message=response_proto)
File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:207, in verify_rest_response(response, endpoint)
205 if response.status_code != 200:
206 if _can_parse_as_json_object(response.text):
--> 207 raise RestException(json.loads(response.text))
208 else:
209 base_msg = "API request to endpoint {} failed with error code {} != 200".format(
210 endpoint,
211 response.status_code,
212 )
File /databricks/python/lib/python3.9/site-packages/mlflow/exceptions.py:102, in RestException.__init__(self, json)
97 error_code = json.get("error_code", ErrorCode.Name(INTERNAL_ERROR))
98 message = "{}: {}".format(
99 error_code,
100 json["message"] if "message" in json else "Response: " + str(json),
101 )
--> 102 super().__init__(message, error_code=ErrorCode.Value(error_code))
103 self.json = json
File /databricks/python/lib/python3.9/site-packages/google/protobuf/internal/enum_type_wrapper.py:73, in EnumTypeWrapper.Value(self, name)
71 except KeyError:
72 pass # fall out to break exception chaining
---> 73 raise ValueError('Enum {} has no value defined for name {!r}'.format(
74 self._enum_type.name, name))
ValueError: Enum ErrorCode has no value defined for name '403'
Config details:
12.2 LTS ML (includes Apache Spark 3.3.2, GPU, Scala 2.12)
โ05-28-2024 03:50 PM
Hello @Alex42 !
The error message indicates that access is forbidden due to an expired access token. This occurs when a notebook or job runs for an extended period, exceeding the default 48-hour threshold set for security reasons. The Databricks access token used by the MLflow Python client to communicate with the tracking server has a limited lifespan, typically expiring after 48 hours. If your ML tasks take longer than this to complete, the access token will expire, resulting in MLflow calls failing with a 403 Invalid access token error.
To resolve the issue, please follow these steps:
from databricks.sdk import WorkspaceClient
import os
os.environ["DATABRICKS_TOKEN"] = "PAT-you-generated-in-step-one"
os.environ["DATABRICKS_HOST"] = "https://<YOUR_DATABRICKS_WORKSPACE_URL>;"
w = WorkspaceClient(
host = os.environ["DATABRICKS_HOST"],
token = os.environ["DATABRICKS_TOKEN"]
)
Alternatively, you can store your PAT in a secret and perform this authentication in a more elegant and secure way, as described in steps 2 and 3 of this Knowledge Base article: https://kb.databricks.com/en_US/machine-learning/mlflow-invalid-access-token-error.
Best Regards,
Jรฉssica Santos
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group