Databricks Community

AChang · ‎08-22-2023

I am following along with this notebook found from this article. I am attempting to fine tune the model with a single node and multiple GPUs, so I run everything up to the "Run Local Training" section, but from there I skip to "Run distributed training on a single node with multiple GPUs". When I run the that first block though, I get this error:

`RuntimeError: TorchDistributor failed during training. View stdout logs for detailed error message.`

Here is the full output I see from the code block:
```
We're using 4 GPUs
Started local training with 4 processes
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
2023-08-22 19:31:47.794586: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-22 19:31:47.809864: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-22 19:31:47.824423: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-22 19:31:47.828933: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/databricks/python/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/databricks/python/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/databricks/python/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/databricks/python/lib/python3.10/site-packages/transformers/optimization.py:407: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
File "/tmp/tmpz1ss252g/train.py", line 8, in <module>
output = train_fn(*args)
File "<command-2821949673242075>", line 46, in train_model
File "/databricks/python/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "/databricks/python/lib/python3.10/site-packages/transformers/trainer.py", line 1855, in _inner_training_loop
self.control = self.callback_handler.on_train_begin(args, self.state, self.control)
File "/databricks/python/lib/python3.10/site-packages/transformers/trainer_callback.py", line 353, in on_train_begin
return self.call_event("on_train_begin", args, state, control)
File "/databricks/python/lib/python3.10/site-packages/transformers/trainer_callback.py", line 397, in call_event
result = getattr(callback, event)(
File "/databricks/python/lib/python3.10/site-packages/transformers/integrations.py", line 1021, in on_train_begin
self.setup(args, state, model)
File "/databricks/python/lib/python3.10/site-packages/transformers/integrations.py", line 990, in setup
self._ml_flow.start_run(run_name=args.run_name, nested=self._nested_run)
File "/databricks/python/lib/python3.10/site-packages/mlflow/tracking/fluent.py", line 363, in start_run
active_run_obj = client.create_run(
File "/databricks/python/lib/python3.10/site-packages/mlflow/tracking/client.py", line 326, in create_run
return self._tracking_client.create_run(experiment_id, start_time, tags, run_name)
File "/databricks/python/lib/python3.10/site-packages/mlflow/tracking/_tracking_service/client.py", line 133, in create_run
return self.store.create_run(
File "/databricks/python/lib/python3.10/site-packages/mlflow/store/tracking/rest_store.py", line 178, in create_run
response_proto = self._call_endpoint(CreateRun, req_body)
File "/databricks/python/lib/python3.10/site-packages/mlflow/store/tracking/rest_store.py", line 59, in _call_endpoint
return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
File "/databricks/python/lib/python3.10/site-packages/mlflow/utils/databricks_utils.py", line 422, in get_databricks_host_creds
config = provider.get_config()
File "/databricks/python/lib/python3.10/site-packages/databricks_cli/configure/provider.py", line 134, in get_config
raise InvalidConfigurationError.for_profile(None)
databricks_cli.utils.InvalidConfigurationError: You haven't configured the CLI yet! Please configure by entering `/tmp/tmpz1ss252g/train.py configure`
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2572 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2573 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2574 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2571) of binary: /local_disk0/.ephemeral_nfs/envs/pythonEnv-3b3dff80-496a-4c7d-9684-b04a17a299d3/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/run.py", line 766, in <module>
main()
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/databricks/python/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/tmp/tmpz1ss252g/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-08-22_19:31:58
host : 0821-144503-em46c4jc-10-52-237-200
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2571)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
```

Do I need to enable more traceback to see more of the error? Do I need to 'configure the CLI', whatever that means? Is there something extremely obvious I'm just missing?

I am using a g5.12xlarge with 4 GPUs, and my DataBricks runtime version is '13.2 ML (includes Apache Spark 3.4.0, GPU, Scala 2.12)'. I'm running this from within a DataBricks notebook.

KYX · ‎04-15-2024

Hi AChang, have you eventually resolved the error? I've also having the same error.

AChang · ‎04-15-2024

Hey @KYX , I don't believe I ever did. You can try to configure the CLI in the ephemeral terminal in the notebook, but it really shouldn't be necessary to do so, so I think something else has to be up.