<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: TrainingArguments fails in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/trainingarguments-fails/m-p/153630#M4599</link>
    <description>&lt;P&gt;Hi ,&lt;/P&gt;&lt;P&gt;The he problem is that&amp;nbsp;TrainingArguments triggers a &lt;STRONG&gt;distributed training detection&lt;/STRONG&gt; routine internally, which tries to inspect the environment for things like MPI, OpenMPI, or other distributed frameworks. In Databricks, this probing hangs because the cluster environment has partial distributed computing infrastructure (Spark) that responds to some of those checks but never completes them.&lt;/P&gt;&lt;P&gt;Try to set following environment variables:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import os
os.environ["WORLD_SIZE"] = "1"
os.environ["LOCAL_RANK"] = "-1"

from transformers import TrainingArguments

print("start")

args = TrainingArguments(
    output_dir="test",
    use_cpu=True              
)

print("end")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 07 Apr 2026 15:27:15 GMT</pubDate>
    <dc:creator>szymon_dybczak</dc:creator>
    <dc:date>2026-04-07T15:27:15Z</dc:date>
    <item>
      <title>TrainingArguments fails</title>
      <link>https://community.databricks.com/t5/machine-learning/trainingarguments-fails/m-p/153618#M4598</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am working on an ML project for text classification and I have a problem.&lt;/P&gt;&lt;P&gt;The following piece of code stalls completely. It prints 'start' but never 'end'.&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from transformers import TrainingArguments
print("start")
args = TrainingArguments(output_dir="test")
print("end")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;I installed the hugging face package with the following cell:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;%pip install  transformers[torch] torch
%restart_python&lt;/LI-CODE&gt;&lt;P&gt;I am trying to run it in runtime 17.3 (without ML) but it also stalls in the serverless environment "base v5" of databricks free which runs in python 3.12.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried running the same code in my laptop with python 3.12 and it does not fail.&lt;/P&gt;&lt;P&gt;Has anyone else had this problem? How did you solve it?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 07 Apr 2026 12:36:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/trainingarguments-fails/m-p/153618#M4598</guid>
      <dc:creator>thomas_berry</dc:creator>
      <dc:date>2026-04-07T12:36:41Z</dc:date>
    </item>
    <item>
      <title>Re: TrainingArguments fails</title>
      <link>https://community.databricks.com/t5/machine-learning/trainingarguments-fails/m-p/153630#M4599</link>
      <description>&lt;P&gt;Hi ,&lt;/P&gt;&lt;P&gt;The he problem is that&amp;nbsp;TrainingArguments triggers a &lt;STRONG&gt;distributed training detection&lt;/STRONG&gt; routine internally, which tries to inspect the environment for things like MPI, OpenMPI, or other distributed frameworks. In Databricks, this probing hangs because the cluster environment has partial distributed computing infrastructure (Spark) that responds to some of those checks but never completes them.&lt;/P&gt;&lt;P&gt;Try to set following environment variables:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import os
os.environ["WORLD_SIZE"] = "1"
os.environ["LOCAL_RANK"] = "-1"

from transformers import TrainingArguments

print("start")

args = TrainingArguments(
    output_dir="test",
    use_cpu=True              
)

print("end")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 07 Apr 2026 15:27:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/trainingarguments-fails/m-p/153630#M4599</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2026-04-07T15:27:15Z</dc:date>
    </item>
    <item>
      <title>Re: TrainingArguments fails</title>
      <link>https://community.databricks.com/t5/machine-learning/trainingarguments-fails/m-p/153631#M4600</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/47963"&gt;@thomas_berry&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This is a well-known issue with transformers and torch in environments that use forked processes or multiprocessing under the hood — which is exactly what Databricks executors and serverless compute do.&lt;BR /&gt;Root cause: TrainingArguments triggers PyTorch's distributed training initialization code, which tries to detect available hardware and set up process groups. In Databricks (both classic and serverless), this spawns or probes subprocesses that deadlock because the Spark executor environment intercepts or blocks certain POSIX signals and fork behaviors. Your laptop doesn't have this problem because it's a clean single-process Python environment.&lt;BR /&gt;The fix: Set the following environment variables before importing anything from transformers or torch. The key one is telling PyTorch not to attempt distributed setup:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import os
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
os.environ["RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
os.environ["TORCHELASTIC_ERROR_FILE"] = "/tmp/torch_error.json"

# This is the critical one — disables the torch.distributed init probe
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "OFF"
os.environ["OMP_NUM_THREADS"] = "1"&lt;/LI-CODE&gt;&lt;P&gt;Then your import and instantiation:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from transformers import TrainingArguments
print("start")
args = TrainingArguments(output_dir="test", no_cuda=True)
print("end")&lt;/LI-CODE&gt;&lt;P&gt;Why no_cuda=True matters here too: Even without a GPU, TrainingArguments will probe CUDA device availability via torch.cuda, which can trigger another hang in Databricks serverless (DBR base v5 / Python 3.12) because the CUDA stub libraries behave differently inside the sandboxed execution environment.&lt;BR /&gt;If you're on serverless specifically, add this as well — it prevents the tokenizers library (a transitive dependency) from spawning its own threads:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;os.environ["TOKENIZERS_PARALLELISM"] = "false"&lt;/LI-CODE&gt;&lt;P&gt;Cleanest pattern for a notebook cell:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import os

os.environ.update({
    "MASTER_ADDR": "localhost",
    "MASTER_PORT": "12355",
    "RANK": "0",
    "WORLD_SIZE": "1",
    "OMP_NUM_THREADS": "1",
    "TOKENIZERS_PARALLELISM": "false",
})

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="/tmp/model_output",
    no_cuda=True,
)
print("TrainingArguments initialized successfully")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 07 Apr 2026 15:29:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/trainingarguments-fails/m-p/153631#M4600</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2026-04-07T15:29:27Z</dc:date>
    </item>
  </channel>
</rss>

