Hi All
I am running a trining job using Mlflow and Databricks recipe. In the recipe.train step the training starts an experiment and runs for 350 epochs. After the 350 epochs are completed and I try to log the artifacts, the process gets stuck for a long time and I keep getting this error multiple times
ValueError: filedescriptor out of range in select()
Sun Jun 1 02:38:07 2025 Connection to spark from PID 4942
Sun Jun 1 02:38:08 2025 Initialized gateway on port 44549
ERROR:py4j.java_gateway:Error while waiting for a connection.
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run
readable, writable, errored = select.select(
^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
Sun Jun 1 02:38:08 2025 Connection to spark from PID 4942
Sun Jun 1 02:38:08 2025 Initialized gateway on port 35473
ERROR:py4j.java_gateway:Error while waiting for a connection.
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run
readable, writable, errored = select.select(
^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
During this time the CPU usage reaches almost 100% and after an hour or so the recipie.train() step fails with
Fatal error: The Python kernel is unresponsive. While through out the training step the CPU and GPU usage are below 40% mostly.
I am also using the databricks recipe to log the regular artifacts as part of the experiment.
Has anyone faced the above issue. Please let me know if any log would help in identifying the real problem.