cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

TrainingArguments fails

thomas_berry
Databricks Partner

Hello,

I am working on an ML project for text classification and I have a problem.

The following piece of code stalls completely. It prints 'start' but never 'end'.

from transformers import TrainingArguments
print("start")
args = TrainingArguments(output_dir="test")
print("end")

 I installed the hugging face package with the following cell:

%pip install  transformers[torch] torch
%restart_python

I am trying to run it in runtime 17.3 (without ML) but it also stalls in the serverless environment "base v5" of databricks free which runs in python 3.12. 

I tried running the same code in my laptop with python 3.12 and it does not fail.

Has anyone else had this problem? How did you solve it?

 

2 REPLIES 2

szymon_dybczak
Esteemed Contributor III

Hi ,

The he problem is that TrainingArguments triggers a distributed training detection routine internally, which tries to inspect the environment for things like MPI, OpenMPI, or other distributed frameworks. In Databricks, this probing hangs because the cluster environment has partial distributed computing infrastructure (Spark) that responds to some of those checks but never completes them.

Try to set following environment variables:

import os
os.environ["WORLD_SIZE"] = "1"
os.environ["LOCAL_RANK"] = "-1"

from transformers import TrainingArguments

print("start")

args = TrainingArguments(
    output_dir="test",
    use_cpu=True              
)

print("end")

 

 

 

lingareddy_Alva
Esteemed Contributor

Hi @thomas_berry 

This is a well-known issue with transformers and torch in environments that use forked processes or multiprocessing under the hood โ€” which is exactly what Databricks executors and serverless compute do.
Root cause: TrainingArguments triggers PyTorch's distributed training initialization code, which tries to detect available hardware and set up process groups. In Databricks (both classic and serverless), this spawns or probes subprocesses that deadlock because the Spark executor environment intercepts or blocks certain POSIX signals and fork behaviors. Your laptop doesn't have this problem because it's a clean single-process Python environment.
The fix: Set the following environment variables before importing anything from transformers or torch. The key one is telling PyTorch not to attempt distributed setup:

import os
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
os.environ["RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
os.environ["TORCHELASTIC_ERROR_FILE"] = "/tmp/torch_error.json"

# This is the critical one โ€” disables the torch.distributed init probe
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "OFF"
os.environ["OMP_NUM_THREADS"] = "1"

Then your import and instantiation:

from transformers import TrainingArguments
print("start")
args = TrainingArguments(output_dir="test", no_cuda=True)
print("end")

Why no_cuda=True matters here too: Even without a GPU, TrainingArguments will probe CUDA device availability via torch.cuda, which can trigger another hang in Databricks serverless (DBR base v5 / Python 3.12) because the CUDA stub libraries behave differently inside the sandboxed execution environment.
If you're on serverless specifically, add this as well โ€” it prevents the tokenizers library (a transitive dependency) from spawning its own threads:

os.environ["TOKENIZERS_PARALLELISM"] = "false"

Cleanest pattern for a notebook cell:

import os

os.environ.update({
    "MASTER_ADDR": "localhost",
    "MASTER_PORT": "12355",
    "RANK": "0",
    "WORLD_SIZE": "1",
    "OMP_NUM_THREADS": "1",
    "TOKENIZERS_PARALLELISM": "false",
})

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="/tmp/model_output",
    no_cuda=True,
)
print("TrainingArguments initialized successfully")

 

LR