topic Re: TrainingArguments fails in Machine Learning

TrainingArguments fails

thomas_berry — Tue, 07 Apr 2026 12:36:41 GMT

Hello,

I am working on an ML project for text classification and I have a problem.

The following piece of code stalls completely. It prints 'start' but never 'end'.

from transformers import TrainingArguments print("start") args = TrainingArguments(output_dir="test") print("end")

I installed the hugging face package with the following cell:

%pip install transformers[torch] torch %restart_python

I am trying to run it in runtime 17.3 (without ML) but it also stalls in the serverless environment "base v5" of databricks free which runs in python 3.12.

I tried running the same code in my laptop with python 3.12 and it does not fail.

Has anyone else had this problem? How did you solve it?

Re: TrainingArguments fails

szymon_dybczak — Tue, 07 Apr 2026 15:27:15 GMT

Hi ,

The he problem is that TrainingArguments triggers a distributed training detection routine internally, which tries to inspect the environment for things like MPI, OpenMPI, or other distributed frameworks. In Databricks, this probing hangs because the cluster environment has partial distributed computing infrastructure (Spark) that responds to some of those checks but never completes them.

Try to set following environment variables:

import os os.environ["WORLD_SIZE"] = "1" os.environ["LOCAL_RANK"] = "-1" from transformers import TrainingArguments print("start") args = TrainingArguments( output_dir="test", use_cpu=True ) print("end")

Re: TrainingArguments fails

lingareddy_Alva — Tue, 07 Apr 2026 15:29:27 GMT

Hi @thomas_berry

This is a well-known issue with transformers and torch in environments that use forked processes or multiprocessing under the hood — which is exactly what Databricks executors and serverless compute do.
Root cause: TrainingArguments triggers PyTorch's distributed training initialization code, which tries to detect available hardware and set up process groups. In Databricks (both classic and serverless), this spawns or probes subprocesses that deadlock because the Spark executor environment intercepts or blocks certain POSIX signals and fork behaviors. Your laptop doesn't have this problem because it's a clean single-process Python environment.
The fix: Set the following environment variables before importing anything from transformers or torch. The key one is telling PyTorch not to attempt distributed setup:

import os os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12355" os.environ["RANK"] = "0" os.environ["WORLD_SIZE"] = "1" os.environ["TORCHELASTIC_ERROR_FILE"] = "/tmp/torch_error.json" # This is the critical one — disables the torch.distributed init probe os.environ["TORCH_DISTRIBUTED_DEBUG"] = "OFF" os.environ["OMP_NUM_THREADS"] = "1"

Then your import and instantiation:

from transformers import TrainingArguments print("start") args = TrainingArguments(output_dir="test", no_cuda=True) print("end")

Why no_cuda=True matters here too: Even without a GPU, TrainingArguments will probe CUDA device availability via torch.cuda, which can trigger another hang in Databricks serverless (DBR base v5 / Python 3.12) because the CUDA stub libraries behave differently inside the sandboxed execution environment.
If you're on serverless specifically, add this as well — it prevents the tokenizers library (a transitive dependency) from spawning its own threads:

os.environ["TOKENIZERS_PARALLELISM"] = "false"

Cleanest pattern for a notebook cell:

import os os.environ.update({ "MASTER_ADDR": "localhost", "MASTER_PORT": "12355", "RANK": "0", "WORLD_SIZE": "1", "OMP_NUM_THREADS": "1", "TOKENIZERS_PARALLELISM": "false", }) from transformers import TrainingArguments args = TrainingArguments( output_dir="/tmp/model_output", no_cuda=True, ) print("TrainingArguments initialized successfully")

Re: TrainingArguments fails

thomas_berry — Wed, 08 Apr 2026 13:39:47 GMT

Hello @lingareddy_Alva ,

Thank you for your reply. I have since been given a cluster with the ML Runtime and the code now works. So I consider the problem solved.