cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

🐞 Stuck on LightGBM Distributed Training in PySpark – Hanging After Socket Communication

amanjethani
New Contributor

🔧My Setup:

I'm trying to run distributed LightGBM training using synapseml.lightgbm.LightGBMRegressor in PySpark.

💻Cluster Details:

  • Spark version: 3.5.1 (compatible with PySpark 3.5.6)

  • PySpark version: 3.5.6

  • synapseml: v0.11.1 (latest)

  • Spark Cluster: 3 Hetzner nodes

    • Driver: 5.161.217.134

    • Worker 1: 159.69.6.195

    • Worker 2: 91.99.133.95

  • Ports Open: 30000–45000 TCP on all nodes (very wide range just to get things working)

What Works:

  • Cluster is configured correctly. All Spark jobs and partitions are assigned and shuffled as expected.

  • LightGBM begins training; it launches sockets and receives enabledTask:<ip>:<port>:<partition>:<taskid> messages from all worker nodes.

  • No errors appear in the logs.

The Problem:

The training gets stuck at the point where the driver closes all sockets after receiving topology info. Specifically, logs stop here:

NetworkManager: driver writing back network topology to 2 connections: ...
NetworkManager: driver writing back partition topology to 2 connections: ...
NetworkManager: driver closing all sockets and server socket
NetworkManager: driver done closing all sockets and server socket

After this, nothing happens — no progress, no error, no timeout. Spark UI just shows the job hanging.

🔍What I’ve Tried:

  • Repartitioned data to match number of workers.

  • Verified that all workers are reachable from driver on the open ports.

  • Set parallelism="data_parallel", also tried tree_learner="data" explicitly.

  • Experimented with broadcast & partition sizes to no avail.

 

My Questions:

  1. Why does training hang even after all workers successfully establish socket communication?

  2. Is this a known issue with certain versions of synapseml or LightGBM?

  3. How can I restrict or fix the port range LightGBM uses? I want to avoid opening a massive 30000–45000 range — can this be pinned reliably?

  4. Any workaround or logs I should enable to debug deeper (e.g., LightGBM internal debug mode)?

  5. Is it possible that a missing barrier or stage finalization in Spark is causing this silent hang?


Sample Code

from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from synapse.ml.lightgbm import LightGBMRegressor

spark = SparkSession.builder.appName("Distributed LightGBM").getOrCreate()

df = spark.range(0, 100000)
for i in range(20):
df = df.withColumn(f"f{i}", (df["id"] * 0.1 + i) % 1)
df = df.withColumn("label", (df["id"] % 2).cast("double"))

features = [f"f{i}" for i in range(20)]
vec = VectorAssembler(inputCols=features, outputCol="features")
df = vec.transform(df).select("features", "label").repartition(2)

lgbm = LightGBMRegressor(
objective="binary",
featuresCol="features",
labelCol="label",
numIterations=100,
learningRate=0.1,
numLeaves=31,
earlyStoppingRound=10,
verbosity=1,
parallelism="data_parallel",
)

model = lgbm.fit(df)

$SPARK_HOME/bin/spark-submit --master spark://5.161.217.134:3342 --conf spark.driver.host=5.161.217.134 --conf spark.driver.port=3346 --conf spark.driver.bindAddress=0.0.0.0 --conf spark.executor.memory=29g --conf spark.executor.cores=16 --conf spark.driver.memory=8g --conf spark.blockManager.port=3347 --conf spark.fileserver.port=3348 --conf spark.ui.port=3379 --conf spark.broadcast.port=3350 --conf spark.task.cpus=1 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=1024m --conf spark.sql.shuffle.partitions=200 --conf spark.sql.execution.arrow.pyspark.enabled=true --conf spark.memory.fraction=0.9 --conf spark.memory.storageFraction=0.4 --packages com.microsoft.azure:synapseml_2.12:0.11.1 spark_lgb.py

 Using the above command to run the pyspark job.

1 REPLY 1

stbjelcevic
Databricks Employee
Databricks Employee

Hi @amanjethani ,

Thanks for laying out the setup and symptoms so clearly. The hang likely occurs because LightGBM’s distributed network either doesn’t fully form between executors or because the expected task count doesn’t match actual tasks, leading to a deadlock immediately after the driver closes its coordination sockets. Pinning a single listen port (e.g., 12400) on executors, matching numTasks to your partitions, and toggling barrier execution mode are the most reliable fixes. Set a time_out to fail fast instead of hanging so you can capture the underlying error in logs.

To answer you questions specifically:

 

  • Why does training hang even after all workers successfully establish socket communication?
    Because the LightGBM workers still need to complete peer-to-peer connections across all executors after the driver shares topology. Any failure to connect to a peer (firewall, port not open, wrong IP/interface) or a mismatch in the number of tasks expected vs. actually started will cause the network init to block without error in Spark. (source)

  • Is this a known issue with certain versions of synapseml or LightGBM?
    Yes—there are reports of indefinite hangs on SynapseML 0.11.1 in distributed fits (non-Databricks) and prior versions where barrier mode and networking interplay caused hangs; turning off barrier mode has helped in some cases. (source)

  • How can I restrict or fix the port range LightGBM uses?
    You can pin the ports with SynapseML parameters defaultListenPort (executors) and driverListenPort (driver), and/or pass LightGBM’s native local_listen_port via passThroughArgs. The default listen port used by LightGBM is 12400, and you can reliably pin to a single port instead of a wide range. (source)

  • Any workaround or logs I should enable to debug deeper?
    Increase SynapseML verbosity to 2 for more detailed logs, enable barrier mode selectively, and set LightGBM’s time_out (minutes) to cause a timeout instead of an infinite wait. Use Spark’s logs and explicit connectivity tests between workers on the pinned port to validate end-to-end reachability.

  • Is it possible that a missing barrier or stage finalization in Spark is causing this silent hang?
    Yes—SynapseML explains that deadlocks can occur during initialization if the driver’s expected task count differs from actual, and barrier mode is available as a mitigation (with caveats). Toggling it can resolve hangs depending on the cluster.

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now