Databricks Community

Alex42 · ‎09-05-2023

Hi there,

After exactly 2d of training, the following error is raised after an API call to MLflow:

ValueError: Enum ErrorCode has no value defined for name '403'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:42, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     41         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 42     return trainer_fn(*args, **kwargs)
     44 except _TunerExitException:

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:568, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    562 ckpt_path = self._checkpoint_connector._select_ckpt_path(
    563     self.state.fn,
    564     ckpt_path,
    565     model_provided=True,
    566     model_connected=self.lightning_module is not None,
    567 )
--> 568 self._run(model, ckpt_path=ckpt_path)
    570 assert self.state.stopped

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:973, in Trainer._run(self, model, ckpt_path)
    970 # ----------------------------
    971 # RUN THE TRAINER
    972 # ----------------------------
--> 973 results = self._run_stage()
    975 # ----------------------------
    976 # POST-Training CLEAN UP
    977 # ----------------------------

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1016, in Trainer._run_stage(self)
   1015 with torch.autograd.set_detect_anomaly(self._detect_anomaly):
-> 1016     self.fit_loop.run()
   1017 return None

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py:201, in _FitLoop.run(self)
    200 self.on_advance_start()
--> 201 self.advance()
    202 self.on_advance_end()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py:354, in _FitLoop.advance(self)
    353 with self.trainer.profiler.profile("run_training_epoch"):
--> 354     self.epoch_loop.run(self._data_fetcher)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py:133, in _TrainingEpochLoop.run(self, data_fetcher)
    132 try:
--> 133     self.advance(data_fetcher)
    134     self.on_advance_end()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py:206, in _TrainingEpochLoop.advance(self, data_fetcher)
    204 else:
    205     # hook
--> 206     call._call_callback_hooks(trainer, "on_train_batch_start", batch, batch_idx)
    207     response = call._call_lightning_module_hook(trainer, "on_train_batch_start", batch, batch_idx)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:193, in _call_callback_hooks(trainer, hook_name, monitoring_callbacks, *args, **kwargs)
    192         with trainer.profiler.profile(f"[Callback]{callback.state_key}.{hook_name}"):
--> 193             fn(trainer, trainer.lightning_module, *args, **kwargs)
    195 if pl_module:
    196     # restore current_fx when nested context

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/callbacks/lr_monitor.py:158, in LearningRateMonitor.on_train_batch_start(self, trainer, *args, **kwargs)
    157 for logger in trainer.loggers:
--> 158     logger.log_metrics(latest_stat, step=trainer.fit_loop.epoch_loop._batches_that_stepped)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py:32, in rank_zero_only.<locals>.wrapped_fn(*args, **kwargs)
     31 if rank == 0:
---> 32     return fn(*args, **kwargs)
     33 return None

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loggers/mlflow.py:272, in MLFlowLogger.log_metrics(self, metrics, step)
    270     metrics_list.append(Metric(key=k, value=v, timestamp=timestamp_ms, step=step or 0))
--> 272 self.experiment.log_batch(run_id=self.run_id, metrics=metrics_list)

File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/client.py:965, in MlflowClient.log_batch(self, run_id, metrics, params, tags)
    915 """
    916 Log multiple metrics, params, and/or tags.
    917 
   (...)
    963     status: FINISHED
    964 """
--> 965 self._tracking_client.log_batch(run_id, metrics, params, tags)

File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py:394, in TrackingServiceClient.log_batch(self, run_id, metrics, params, tags)
    393 for metrics_batch in chunk_list(metrics, chunk_size=MAX_METRICS_PER_BATCH):
--> 394     self.store.log_batch(run_id=run_id, metrics=metrics_batch, params=[], tags=[])

File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:321, in RestStore.log_batch(self, run_id, metrics, params, tags)
    318 req_body = message_to_json(
    319     LogBatch(metrics=metric_protos, params=param_protos, tags=tag_protos, run_id=run_id)
    320 )
--> 321 self._call_endpoint(LogBatch, req_body)

File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:56, in RestStore._call_endpoint(self, api, json_body)
     55 response_proto = api.Response()
---> 56 return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)

File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:281, in call_endpoint(host_creds, endpoint, method, json_body, response_proto)
    278     response = http_request(
    279         host_creds=host_creds, endpoint=endpoint, method=method, json=json_body
    280     )
--> 281 response = verify_rest_response(response, endpoint)
    282 js_dict = json.loads(response.text)

File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:207, in verify_rest_response(response, endpoint)
    206 if _can_parse_as_json_object(response.text):
--> 207     raise RestException(json.loads(response.text))
    208 else:

File /databricks/python/lib/python3.9/site-packages/mlflow/exceptions.py:102, in RestException.__init__(self, json)
     98 message = "{}: {}".format(
     99     error_code,
    100     json["message"] if "message" in json else "Response: " + str(json),
    101 )
--> 102 super().__init__(message, error_code=ErrorCode.Value(error_code))
    103 self.json = json

File /databricks/python/lib/python3.9/site-packages/google/protobuf/internal/enum_type_wrapper.py:73, in EnumTypeWrapper.Value(self, name)
     72   pass  # fall out to break exception chaining
---> 73 raise ValueError('Enum {} has no value defined for name {!r}'.format(
     74     self._enum_type.name, name))

ValueError: Enum ErrorCode has no value defined for name '403'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
File <command-1886660812327855>:15
      1 torch.set_float32_matmul_precision(hparams.float32_matmul_precision)
      2 with (
      3     train_converter.make_torch_dataloader(
      4         batch_size=hparams.batch_size, num_epochs=1
   (...)
     13     # batch["tokens"].to('cpu')
     14     # pl_module._forward_with_loss(batch,"debug")
---> 15     trainer.fit(pl_module, train_dl, val_dl)
     16 clear_pl_module()

File /databricks/python/lib/python3.9/site-packages/mlflow/utils/autologging_utils/safety.py:435, in safe_patch.<locals>.safe_patch_function(*args, **kwargs)
    420 if (
    421     active_session_failed
    422     or autologging_is_disabled(autologging_integration)
   (...)
    429     # warning behavior during original function execution, since autologging is being
    430     # skipped
    431     with set_non_mlflow_warnings_behavior_for_current_thread(
    432         disable_warnings=False,
    433         reroute_warnings=False,
    434     ):
--> 435         return original(*args, **kwargs)
    437 # Whether or not the original / underlying function has been called during the
    438 # execution of patched code
    439 original_has_been_called = False

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:529, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    527 model = _maybe_unwrap_optimized(model)
    528 self.strategy._lightning_module = model
--> 529 call._call_and_handle_interrupt(
    530     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    531 )

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:65, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     63 trainer.strategy.on_exception(exception)
     64 for logger in trainer.loggers:
---> 65     logger.finalize("failed")
     66 trainer._teardown()
     67 # teardown might access the stage so we reset it after

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py:32, in rank_zero_only.<locals>.wrapped_fn(*args, **kwargs)
     30     raise RuntimeError("The `rank_zero_only.rank` needs to be set before use")
     31 if rank == 0:
---> 32     return fn(*args, **kwargs)
     33 return None

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loggers/mlflow.py:289, in MLFlowLogger.finalize(self, status)
    286 if self._checkpoint_callback:
    287     self._scan_and_log_checkpoints(self._checkpoint_callback)
--> 289 if self.experiment.get_run(self.run_id):
    290     self.experiment.set_terminated(self.run_id, status)

File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/client.py:150, in MlflowClient.get_run(self, run_id)
    112 def get_run(self, run_id: str) -> Run:
    113     """
    114     Fetch the run from backend store. The resulting :py:class:`Run <mlflow.entities.Run>`
    115     contains a collection of run metadata -- :py:class:`RunInfo <mlflow.entities.RunInfo>`,
   (...)
    148         status: FINISHED
    149     """
--> 150     return self._tracking_client.get_run(run_id)

File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py:72, in TrackingServiceClient.get_run(self, run_id)
     58 """
     59 Fetch the run from backend store. The resulting :py:class:`Run <mlflow.entities.Run>`
     60 contains a collection of run metadata -- :py:class:`RunInfo <mlflow.entities.RunInfo>`,
   (...)
     69          raises an exception.
     70 """
     71 _validate_run_id(run_id)
---> 72 return self.store.get_run(run_id)

File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:134, in RestStore.get_run(self, run_id)
    126 """
    127 Fetch the run from backend store
    128 
   (...)
    131 :return: A single Run object if it exists, otherwise raises an Exception
    132 """
    133 req_body = message_to_json(GetRun(run_uuid=run_id, run_id=run_id))
--> 134 response_proto = self._call_endpoint(GetRun, req_body)
    135 return Run.from_proto(response_proto.run)

File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:56, in RestStore._call_endpoint(self, api, json_body)
     54 endpoint, method = _METHOD_TO_INFO[api]
     55 response_proto = api.Response()
---> 56 return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)

File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:281, in call_endpoint(host_creds, endpoint, method, json_body, response_proto)
    277 else:
    278     response = http_request(
    279         host_creds=host_creds, endpoint=endpoint, method=method, json=json_body
    280     )
--> 281 response = verify_rest_response(response, endpoint)
    282 js_dict = json.loads(response.text)
    283 parse_dict(js_dict=js_dict, message=response_proto)

File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:207, in verify_rest_response(response, endpoint)
    205 if response.status_code != 200:
    206     if _can_parse_as_json_object(response.text):
--> 207         raise RestException(json.loads(response.text))
    208     else:
    209         base_msg = "API request to endpoint {} failed with error code {} != 200".format(
    210             endpoint,
    211             response.status_code,
    212         )

File /databricks/python/lib/python3.9/site-packages/mlflow/exceptions.py:102, in RestException.__init__(self, json)
     97 error_code = json.get("error_code", ErrorCode.Name(INTERNAL_ERROR))
     98 message = "{}: {}".format(
     99     error_code,
    100     json["message"] if "message" in json else "Response: " + str(json),
    101 )
--> 102 super().__init__(message, error_code=ErrorCode.Value(error_code))
    103 self.json = json

File /databricks/python/lib/python3.9/site-packages/google/protobuf/internal/enum_type_wrapper.py:73, in EnumTypeWrapper.Value(self, name)
     71 except KeyError:
     72   pass  # fall out to break exception chaining
---> 73 raise ValueError('Enum {} has no value defined for name {!r}'.format(
     74     self._enum_type.name, name))

ValueError: Enum ErrorCode has no value defined for name '403'

Config details:

12.2 LTS ML (includes Apache Spark 3.3.2, GPU, Scala 2.12)

pytorch-lightning==2.0.5

sagemaker==2.165.0

tokenizers==0.13.3

transformers==4.31.0

Does anyone have tips/insights on how to avoid timing out? I'm guessing a temporary token for DB MLflow access expired somewhere.

jessysantos · ‎05-28-2024

Hello @Alex42 !

The error message indicates that access is forbidden due to an expired access token. This occurs when a notebook or job runs for an extended period, exceeding the default 48-hour threshold set for security reasons. The Databricks access token used by the MLflow Python client to communicate with the tracking server has a limited lifespan, typically expiring after 48 hours. If your ML tasks take longer than this to complete, the access token will expire, resulting in MLflow calls failing with a 403 Invalid access token error.

To resolve the issue, please follow these steps:

Create a Personal Access Token (PAT) by following the instructions outlined in this documentation: https://docs.databricks.com/en/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspa.... Please keep the current token’s default lifetime of 90 days.
Next, using the Databricks SDK, perform authentication at the beginning of your code by using the following commands:

from databricks.sdk import WorkspaceClient
import os

os.environ["DATABRICKS_TOKEN"] = "PAT-you-generated-in-step-one"
os.environ["DATABRICKS_HOST"] = "https://<YOUR_DATABRICKS_WORKSPACE_URL>;"

w = WorkspaceClient(
  host  = os.environ["DATABRICKS_HOST"],
  token = os.environ["DATABRICKS_TOKEN"]
)

Alternatively, you can store your PAT in a secret and perform this authentication in a more elegant and secure way, as described in steps 2 and 3 of this Knowledge Base article: https://kb.databricks.com/en_US/machine-learning/mlflow-invalid-access-token-error.

Best Regards,

Jéssica Santos