cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Bug: MLflow connection fails after 2d

Alex42
New Contributor

Hi there, 

After exactly 2d of training, the following error is raised after an API call to MLflow:

 

 

ValueError: Enum ErrorCode has no value defined for name '403'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:42, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     41         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 42     return trainer_fn(*args, **kwargs)
     44 except _TunerExitException:

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:568, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    562 ckpt_path = self._checkpoint_connector._select_ckpt_path(
    563     self.state.fn,
    564     ckpt_path,
    565     model_provided=True,
    566     model_connected=self.lightning_module is not None,
    567 )
--> 568 self._run(model, ckpt_path=ckpt_path)
    570 assert self.state.stopped

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:973, in Trainer._run(self, model, ckpt_path)
    970 # ----------------------------
    971 # RUN THE TRAINER
    972 # ----------------------------
--> 973 results = self._run_stage()
    975 # ----------------------------
    976 # POST-Training CLEAN UP
    977 # ----------------------------

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1016, in Trainer._run_stage(self)
   1015 with torch.autograd.set_detect_anomaly(self._detect_anomaly):
-> 1016     self.fit_loop.run()
   1017 return None

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py:201, in _FitLoop.run(self)
    200 self.on_advance_start()
--> 201 self.advance()
    202 self.on_advance_end()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py:354, in _FitLoop.advance(self)
    353 with self.trainer.profiler.profile("run_training_epoch"):
--> 354     self.epoch_loop.run(self._data_fetcher)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py:133, in _TrainingEpochLoop.run(self, data_fetcher)
    132 try:
--> 133     self.advance(data_fetcher)
    134     self.on_advance_end()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loops/training_epoch_loop.py:206, in _TrainingEpochLoop.advance(self, data_fetcher)
    204 else:
    205     # hook
--> 206     call._call_callback_hooks(trainer, "on_train_batch_start", batch, batch_idx)
    207     response = call._call_lightning_module_hook(trainer, "on_train_batch_start", batch, batch_idx)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:193, in _call_callback_hooks(trainer, hook_name, monitoring_callbacks, *args, **kwargs)
    192         with trainer.profiler.profile(f"[Callback]{callback.state_key}.{hook_name}"):
--> 193             fn(trainer, trainer.lightning_module, *args, **kwargs)
    195 if pl_module:
    196     # restore current_fx when nested context

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/callbacks/lr_monitor.py:158, in LearningRateMonitor.on_train_batch_start(self, trainer, *args, **kwargs)
    157 for logger in trainer.loggers:
--> 158     logger.log_metrics(latest_stat, step=trainer.fit_loop.epoch_loop._batches_that_stepped)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py:32, in rank_zero_only.<locals>.wrapped_fn(*args, **kwargs)
     31 if rank == 0:
---> 32     return fn(*args, **kwargs)
     33 return None

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loggers/mlflow.py:272, in MLFlowLogger.log_metrics(self, metrics, step)
    270     metrics_list.append(Metric(key=k, value=v, timestamp=timestamp_ms, step=step or 0))
--> 272 self.experiment.log_batch(run_id=self.run_id, metrics=metrics_list)

File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/client.py:965, in MlflowClient.log_batch(self, run_id, metrics, params, tags)
    915 """
    916 Log multiple metrics, params, and/or tags.
    917 
   (...)
    963     status: FINISHED
    964 """
--> 965 self._tracking_client.log_batch(run_id, metrics, params, tags)

File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py:394, in TrackingServiceClient.log_batch(self, run_id, metrics, params, tags)
    393 for metrics_batch in chunk_list(metrics, chunk_size=MAX_METRICS_PER_BATCH):
--> 394     self.store.log_batch(run_id=run_id, metrics=metrics_batch, params=[], tags=[])

File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:321, in RestStore.log_batch(self, run_id, metrics, params, tags)
    318 req_body = message_to_json(
    319     LogBatch(metrics=metric_protos, params=param_protos, tags=tag_protos, run_id=run_id)
    320 )
--> 321 self._call_endpoint(LogBatch, req_body)

File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:56, in RestStore._call_endpoint(self, api, json_body)
     55 response_proto = api.Response()
---> 56 return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)

File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:281, in call_endpoint(host_creds, endpoint, method, json_body, response_proto)
    278     response = http_request(
    279         host_creds=host_creds, endpoint=endpoint, method=method, json=json_body
    280     )
--> 281 response = verify_rest_response(response, endpoint)
    282 js_dict = json.loads(response.text)

File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:207, in verify_rest_response(response, endpoint)
    206 if _can_parse_as_json_object(response.text):
--> 207     raise RestException(json.loads(response.text))
    208 else:

File /databricks/python/lib/python3.9/site-packages/mlflow/exceptions.py:102, in RestException.__init__(self, json)
     98 message = "{}: {}".format(
     99     error_code,
    100     json["message"] if "message" in json else "Response: " + str(json),
    101 )
--> 102 super().__init__(message, error_code=ErrorCode.Value(error_code))
    103 self.json = json

File /databricks/python/lib/python3.9/site-packages/google/protobuf/internal/enum_type_wrapper.py:73, in EnumTypeWrapper.Value(self, name)
     72   pass  # fall out to break exception chaining
---> 73 raise ValueError('Enum {} has no value defined for name {!r}'.format(
     74     self._enum_type.name, name))

ValueError: Enum ErrorCode has no value defined for name '403'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
File <command-1886660812327855>:15
      1 torch.set_float32_matmul_precision(hparams.float32_matmul_precision)
      2 with (
      3     train_converter.make_torch_dataloader(
      4         batch_size=hparams.batch_size, num_epochs=1
   (...)
     13     # batch["tokens"].to('cpu')
     14     # pl_module._forward_with_loss(batch,"debug")
---> 15     trainer.fit(pl_module, train_dl, val_dl)
     16 clear_pl_module()

File /databricks/python/lib/python3.9/site-packages/mlflow/utils/autologging_utils/safety.py:435, in safe_patch.<locals>.safe_patch_function(*args, **kwargs)
    420 if (
    421     active_session_failed
    422     or autologging_is_disabled(autologging_integration)
   (...)
    429     # warning behavior during original function execution, since autologging is being
    430     # skipped
    431     with set_non_mlflow_warnings_behavior_for_current_thread(
    432         disable_warnings=False,
    433         reroute_warnings=False,
    434     ):
--> 435         return original(*args, **kwargs)
    437 # Whether or not the original / underlying function has been called during the
    438 # execution of patched code
    439 original_has_been_called = False

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:529, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    527 model = _maybe_unwrap_optimized(model)
    528 self.strategy._lightning_module = model
--> 529 call._call_and_handle_interrupt(
    530     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    531 )

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:65, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     63 trainer.strategy.on_exception(exception)
     64 for logger in trainer.loggers:
---> 65     logger.finalize("failed")
     66 trainer._teardown()
     67 # teardown might access the stage so we reset it after

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py:32, in rank_zero_only.<locals>.wrapped_fn(*args, **kwargs)
     30     raise RuntimeError("The `rank_zero_only.rank` needs to be set before use")
     31 if rank == 0:
---> 32     return fn(*args, **kwargs)
     33 return None

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pytorch_lightning/loggers/mlflow.py:289, in MLFlowLogger.finalize(self, status)
    286 if self._checkpoint_callback:
    287     self._scan_and_log_checkpoints(self._checkpoint_callback)
--> 289 if self.experiment.get_run(self.run_id):
    290     self.experiment.set_terminated(self.run_id, status)

File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/client.py:150, in MlflowClient.get_run(self, run_id)
    112 def get_run(self, run_id: str) -> Run:
    113     """
    114     Fetch the run from backend store. The resulting :py:class:`Run <mlflow.entities.Run>`
    115     contains a collection of run metadata -- :py:class:`RunInfo <mlflow.entities.RunInfo>`,
   (...)
    148         status: FINISHED
    149     """
--> 150     return self._tracking_client.get_run(run_id)

File /databricks/python/lib/python3.9/site-packages/mlflow/tracking/_tracking_service/client.py:72, in TrackingServiceClient.get_run(self, run_id)
     58 """
     59 Fetch the run from backend store. The resulting :py:class:`Run <mlflow.entities.Run>`
     60 contains a collection of run metadata -- :py:class:`RunInfo <mlflow.entities.RunInfo>`,
   (...)
     69          raises an exception.
     70 """
     71 _validate_run_id(run_id)
---> 72 return self.store.get_run(run_id)

File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:134, in RestStore.get_run(self, run_id)
    126 """
    127 Fetch the run from backend store
    128 
   (...)
    131 :return: A single Run object if it exists, otherwise raises an Exception
    132 """
    133 req_body = message_to_json(GetRun(run_uuid=run_id, run_id=run_id))
--> 134 response_proto = self._call_endpoint(GetRun, req_body)
    135 return Run.from_proto(response_proto.run)

File /databricks/python/lib/python3.9/site-packages/mlflow/store/tracking/rest_store.py:56, in RestStore._call_endpoint(self, api, json_body)
     54 endpoint, method = _METHOD_TO_INFO[api]
     55 response_proto = api.Response()
---> 56 return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)

File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:281, in call_endpoint(host_creds, endpoint, method, json_body, response_proto)
    277 else:
    278     response = http_request(
    279         host_creds=host_creds, endpoint=endpoint, method=method, json=json_body
    280     )
--> 281 response = verify_rest_response(response, endpoint)
    282 js_dict = json.loads(response.text)
    283 parse_dict(js_dict=js_dict, message=response_proto)

File /databricks/python/lib/python3.9/site-packages/mlflow/utils/rest_utils.py:207, in verify_rest_response(response, endpoint)
    205 if response.status_code != 200:
    206     if _can_parse_as_json_object(response.text):
--> 207         raise RestException(json.loads(response.text))
    208     else:
    209         base_msg = "API request to endpoint {} failed with error code {} != 200".format(
    210             endpoint,
    211             response.status_code,
    212         )

File /databricks/python/lib/python3.9/site-packages/mlflow/exceptions.py:102, in RestException.__init__(self, json)
     97 error_code = json.get("error_code", ErrorCode.Name(INTERNAL_ERROR))
     98 message = "{}: {}".format(
     99     error_code,
    100     json["message"] if "message" in json else "Response: " + str(json),
    101 )
--> 102 super().__init__(message, error_code=ErrorCode.Value(error_code))
    103 self.json = json

File /databricks/python/lib/python3.9/site-packages/google/protobuf/internal/enum_type_wrapper.py:73, in EnumTypeWrapper.Value(self, name)
     71 except KeyError:
     72   pass  # fall out to break exception chaining
---> 73 raise ValueError('Enum {} has no value defined for name {!r}'.format(
     74     self._enum_type.name, name))

ValueError: Enum ErrorCode has no value defined for name '403'

 

 

 

 

 

 

 

 

Config details:

12.2 LTS ML (includes Apache Spark 3.3.2, GPU, Scala 2.12)

 
Does anyone have tips/insights on how to avoid timing out? I'm guessing a temporary token for DB MLflow access expired somewhere.
1 REPLY 1

jessysantos
Databricks Employee
Databricks Employee

Hello @Alex42 !

The error message indicates that access is forbidden due to an expired access token. This occurs when a notebook or job runs for an extended period, exceeding the default 48-hour threshold set for security reasons. The Databricks access token used by the MLflow Python client to communicate with the tracking server has a limited lifespan, typically expiring after 48 hours. If your ML tasks take longer than this to complete, the access token will expire, resulting in MLflow calls failing with a 403 Invalid access token error.

To resolve the issue, please follow these steps:

  1. Create a Personal Access Token (PAT) by following the instructions outlined in this documentation: https://docs.databricks.com/en/dev-tools/auth/pat.html#databricks-personal-access-tokens-for-workspa.... Please keep the current token’s default lifetime of 90 days.
  2. Next, using the Databricks SDK, perform authentication at the beginning of your code by using the following commands:
from databricks.sdk import WorkspaceClient
import os

os.environ["DATABRICKS_TOKEN"] = "PAT-you-generated-in-step-one"
os.environ["DATABRICKS_HOST"] = "https://<YOUR_DATABRICKS_WORKSPACE_URL>;"

w = WorkspaceClient(
  host  = os.environ["DATABRICKS_HOST"],
  token = os.environ["DATABRICKS_TOKEN"]
)

Alternatively, you can store your PAT in a secret and perform this authentication in a more elegant and secure way, as described in steps 2 and 3 of this Knowledge Base article: https://kb.databricks.com/en_US/machine-learning/mlflow-invalid-access-token-error.

Best Regards,

Jéssica Santos



Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group