cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

🐞 Stuck on LightGBM Distributed Training in PySpark – Hanging After Socket Communication

amanjethani
New Contributor

🔧My Setup:

I'm trying to run distributed LightGBM training using synapseml.lightgbm.LightGBMRegressor in PySpark.

💻Cluster Details:

  • Spark version: 3.5.1 (compatible with PySpark 3.5.6)

  • PySpark version: 3.5.6

  • synapseml: v0.11.1 (latest)

  • Spark Cluster: 3 Hetzner nodes

    • Driver: 5.161.217.134

    • Worker 1: 159.69.6.195

    • Worker 2: 91.99.133.95

  • Ports Open: 30000–45000 TCP on all nodes (very wide range just to get things working)

What Works:

  • Cluster is configured correctly. All Spark jobs and partitions are assigned and shuffled as expected.

  • LightGBM begins training; it launches sockets and receives enabledTask:<ip>:<port>:<partition>:<taskid> messages from all worker nodes.

  • No errors appear in the logs.

The Problem:

The training gets stuck at the point where the driver closes all sockets after receiving topology info. Specifically, logs stop here:

NetworkManager: driver writing back network topology to 2 connections: ...
NetworkManager: driver writing back partition topology to 2 connections: ...
NetworkManager: driver closing all sockets and server socket
NetworkManager: driver done closing all sockets and server socket

After this, nothing happens — no progress, no error, no timeout. Spark UI just shows the job hanging.

🔍What I’ve Tried:

  • Repartitioned data to match number of workers.

  • Verified that all workers are reachable from driver on the open ports.

  • Set parallelism="data_parallel", also tried tree_learner="data" explicitly.

  • Experimented with broadcast & partition sizes to no avail.

 

My Questions:

  1. Why does training hang even after all workers successfully establish socket communication?

  2. Is this a known issue with certain versions of synapseml or LightGBM?

  3. How can I restrict or fix the port range LightGBM uses? I want to avoid opening a massive 30000–45000 range — can this be pinned reliably?

  4. Any workaround or logs I should enable to debug deeper (e.g., LightGBM internal debug mode)?

  5. Is it possible that a missing barrier or stage finalization in Spark is causing this silent hang?


Sample Code

from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from synapse.ml.lightgbm import LightGBMRegressor

spark = SparkSession.builder.appName("Distributed LightGBM").getOrCreate()

df = spark.range(0, 100000)
for i in range(20):
df = df.withColumn(f"f{i}", (df["id"] * 0.1 + i) % 1)
df = df.withColumn("label", (df["id"] % 2).cast("double"))

features = [f"f{i}" for i in range(20)]
vec = VectorAssembler(inputCols=features, outputCol="features")
df = vec.transform(df).select("features", "label").repartition(2)

lgbm = LightGBMRegressor(
objective="binary",
featuresCol="features",
labelCol="label",
numIterations=100,
learningRate=0.1,
numLeaves=31,
earlyStoppingRound=10,
verbosity=1,
parallelism="data_parallel",
)

model = lgbm.fit(df)

$SPARK_HOME/bin/spark-submit --master spark://5.161.217.134:3342 --conf spark.driver.host=5.161.217.134 --conf spark.driver.port=3346 --conf spark.driver.bindAddress=0.0.0.0 --conf spark.executor.memory=29g --conf spark.executor.cores=16 --conf spark.driver.memory=8g --conf spark.blockManager.port=3347 --conf spark.fileserver.port=3348 --conf spark.ui.port=3379 --conf spark.broadcast.port=3350 --conf spark.task.cpus=1 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=1024m --conf spark.sql.shuffle.partitions=200 --conf spark.sql.execution.arrow.pyspark.enabled=true --conf spark.memory.fraction=0.9 --conf spark.memory.storageFraction=0.4 --packages com.microsoft.azure:synapseml_2.12:0.11.1 spark_lgb.py

 Using the above command to run the pyspark job.

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now