How to fix this runtime error in this Databricks distributed training tutorial workbook

AChang — Tue, 22 Aug 2023 20:38:44 GMT

I am following along with this notebook found from this article. I am attempting to fine tune the model with a single node and multiple GPUs, so I run everything up to the "Run Local Training" section, but from there I skip to "Run distributed training on a single node with multiple GPUs". When I run the that first block though, I get this error:

`RuntimeError: TorchDistributor failed during training. View stdout logs for detailed error message.`

Here is the full output I see from the code block:
```
We're using 4 GPUs
Started local training with 4 processes
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
2023-08-22 19:31:47.794586: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-22 19:31:47.809864: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-22 19:31:47.824423: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-22 19:31:47.828933: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/databricks/python/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/databricks/python/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/databricks/python/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/databricks/python/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
File "/tmp/tmpz1ss252g/train.py", line 8, in <module>
output = train_fn(*args)
File "<command-2821949673242075>", line 46, in train_model
File "/databricks/python/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "/databricks/python/lib/python3.10/site-packages/transformers/trainer.py", line 1855, in _inner_training_loop
self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
File "/databricks/python/lib/python3.10/site-packages/transformers/trainer_callback.py", line 353, in on_train_begin
return self.call_event("on_train_begin", args, state, control)
File "/databricks/python/lib/python3.10/site-packages/transformers/trainer_callback.py", line 397, in call_event
result = getattr(callback, event)(
File "/databricks/python/lib/python3.10/site-packages/transformers/integrations.py", line 1021, in on_train_begin
self.setup(args, state, model)
File "/databricks/python/lib/python3.10/site-packages/transformers/integrations.py", line 990, in setup
self._ml_flow.start_run(run_name=args.run_name, nested=self._nested_run)
File "/databricks/python/lib/python3.10/site-packages/mlflow/tracking/fluent.py", line 363, in start_run
active_run_obj = client.create_run(
File "/databricks/python/lib/python3.10/site-packages/mlflow/tracking/client.py", line 326, in create_run
return self._tracking_client.create_run(experiment_id, start_time, tags, run_name)
File "/databricks/python/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 133, in create_run
return self.store.create_run(
File "/databricks/python/lib/python3.10/site-packages/mlflow/store/tracking/rest_store.py", line 178, in create_run
response_proto = self._call_endpoint(CreateRun, req_body)
File "/databricks/python/lib/python3.10/site-packages/mlflow/store/tracking/rest_store.py", line 59, in _call_endpoint
return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File "/databricks/python/lib/python3.10/site-packages/mlflow/utils/databricks_utils.py", line 422, in get_databricks_host_creds
config = provider.get_config()
File "/databricks/python/lib/python3.10/site-packages/databricks_cli/configure/provider.py", line 134, in get_config
raise InvalidConfigurationError.for_profile(None)
databricks_cli.utils.InvalidConfigurationError: You haven't configured the CLI yet! Please configure by entering `/tmp/tmpz1ss252g/train.py configure`
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2572 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2573 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2574 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2571) of binary: /local_disk0/.ephemeral_nfs/envs/pythonEnv-3b3dff80-496a-4c7d-9684-b04a17a299d3/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/run.py", line 766, in <module>
main()
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/tmp/tmpz1ss252g/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-08-22_19:31:58
host : 0821-144503-em46c4jc-10-52-237-200
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2571)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
```

Do I need to enable more traceback to see more of the error? Do I need to 'configure the CLI', whatever that means? Is there something extremely obvious I'm just missing?

I am using a g5.12xlarge with 4 GPUs, and my DataBricks runtime version is '13.2 ML (includes Apache Spark 3.4.0, GPU, Scala 2.12)'. I'm running this from within a DataBricks notebook.

Re: How to fix this runtime error in this Databricks distributed training tutorial workbook

KYX — Mon, 15 Apr 2024 14:45:54 GMT

Hi AChang, have you eventually resolved the error? I've also having the same error.

Re: How to fix this runtime error in this Databricks distributed training tutorial workbook

AChang — Mon, 15 Apr 2024 14:50:35 GMT

Hey @KYX , I don't believe I ever did. You can try to configure the CLI in the ephemeral terminal in the notebook, but it really shouldn't be necessary to do so, so I think something else has to be up.

topic How to fix this runtime error in this Databricks distributed training tutorial workbook in Machine Learning

How to fix this runtime error in this Databricks distributed training tutorial workbook

Re: How to fix this runtime error in this Databricks distributed training tutorial workbook

Re: How to fix this runtime error in this Databricks distributed training tutorial workbook