🔧My Setup:
I'm trying to run distributed LightGBM training using synapseml.lightgbm.LightGBMRegressor in PySpark.
💻Cluster Details:
Spark version: 3.5.1 (compatible with PySpark 3.5.6)
PySpark version: 3.5.6
synapseml: v0.11.1 (latest)
Spark Cluster: 3 Hetzner nodes
Driver: 5.161.217.134
Worker 1: 159.69.6.195
Worker 2: 91.99.133.95
Ports Open: 30000–45000 TCP on all nodes (very wide range just to get things working)
✅What Works:
Cluster is configured correctly. All Spark jobs and partitions are assigned and shuffled as expected.
LightGBM begins training; it launches sockets and receives enabledTask:<ip>:<port>:<partition>:<taskid> messages from all worker nodes.
No errors appear in the logs.
❌The Problem:
The training gets stuck at the point where the driver closes all sockets after receiving topology info. Specifically, logs stop here:
NetworkManager: driver writing back network topology to 2 connections: ...
NetworkManager: driver writing back partition topology to 2 connections: ...
NetworkManager: driver closing all sockets and server socket
NetworkManager: driver done closing all sockets and server socket
After this, nothing happens — no progress, no error, no timeout. Spark UI just shows the job hanging.
🔍What I’ve Tried:
Repartitioned data to match number of workers.
Verified that all workers are reachable from driver on the open ports.
Set parallelism="data_parallel", also tried tree_learner="data" explicitly.
Experimented with broadcast & partition sizes to no avail.
❓My Questions:
Why does training hang even after all workers successfully establish socket communication?
Is this a known issue with certain versions of synapseml or LightGBM?
How can I restrict or fix the port range LightGBM uses? I want to avoid opening a massive 30000–45000 range — can this be pinned reliably?
Any workaround or logs I should enable to debug deeper (e.g., LightGBM internal debug mode)?
Is it possible that a missing barrier or stage finalization in Spark is causing this silent hang?
Sample Code
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from synapse.ml.lightgbm import LightGBMRegressor
spark = SparkSession.builder.appName("Distributed LightGBM").getOrCreate()
df = spark.range(0, 100000)
for i in range(20):
df = df.withColumn(f"f{i}", (df["id"] * 0.1 + i) % 1)
df = df.withColumn("label", (df["id"] % 2).cast("double"))
features = [f"f{i}" for i in range(20)]
vec = VectorAssembler(inputCols=features, outputCol="features")
df = vec.transform(df).select("features", "label").repartition(2)
lgbm = LightGBMRegressor(
objective="binary",
featuresCol="features",
labelCol="label",
numIterations=100,
learningRate=0.1,
numLeaves=31,
earlyStoppingRound=10,
verbosity=1,
parallelism="data_parallel",
)
model = lgbm.fit(df)
$SPARK_HOME/bin/spark-submit --master spark://5.161.217.134:3342 --conf spark.driver.host=5.161.217.134 --conf spark.driver.port=3346 --conf spark.driver.bindAddress=0.0.0.0 --conf spark.executor.memory=29g --conf spark.executor.cores=16 --conf spark.driver.memory=8g --conf spark.blockManager.port=3347 --conf spark.fileserver.port=3348 --conf spark.ui.port=3379 --conf spark.broadcast.port=3350 --conf spark.task.cpus=1 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=1024m --conf spark.sql.shuffle.partitions=200 --conf spark.sql.execution.arrow.pyspark.enabled=true --conf spark.memory.fraction=0.9 --conf spark.memory.storageFraction=0.4 --packages com.microsoft.azure:synapseml_2.12:0.11.1 spark_lgb.py
Using the above command to run the pyspark job.