topic Re: filedescriptor out of range in select() in Machine Learning

filedescriptor out of range in select()

anirbanmishra — Mon, 02 Jun 2025 02:52:51 GMT

Hi All

I am running a trining job using Mlflow and Databricks recipe. In the recipe.train step the training starts an experiment and runs for 350 epochs. After the 350 epochs are completed and I try to log the artifacts, the process gets stuck for a long time and I keep getting this error multiple times

ValueError: filedescriptor out of range in select()

Sun Jun 1 02:38:07 2025 Connection to spark from PID 4942

Sun Jun 1 02:38:08 2025 Initialized gateway on port 44549

ERROR:py4j.java_gateway:Error while waiting for a connection.

Traceback (most recent call last):

File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run

readable, writable, errored = select.select(

^^^^^^^^^^^^^^

ValueError: filedescriptor out of range in select()

Sun Jun 1 02:38:08 2025 Connection to spark from PID 4942

Sun Jun 1 02:38:08 2025 Initialized gateway on port 35473

ERROR:py4j.java_gateway:Error while waiting for a connection.

Traceback (most recent call last):

File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run

readable, writable, errored = select.select(

^^^^^^^^^^^^^^

ValueError: filedescriptor out of range in select()

During this time the CPU usage reaches almost 100% and after an hour or so the recipie.train() step fails with
Fatal error: The Python kernel is unresponsive. While through out the training step the CPU and GPU usage are below 40% mostly.

I am also using the databricks recipe to log the regular artifacts as part of the experiment.

Has anyone faced the above issue. Please let me know if any log would help in identifying the real problem.

Re: filedescriptor out of range in select()

lingareddy_Alva — Wed, 04 Jun 2025 01:13:22 GMT

Hi @anirbanmishra

This is a common issue with MLflow on Databricks, particularly when dealing with large experiments or numerous artifacts.
The "filedescriptor out of range in select()" error typically occurs due to resource exhaustion or connection pool issues with
the Py4J gateway that bridges Python and Spark/JVM.

The most effective immediate solution is usually to reduce the frequency of artifact logging and increase the file descriptor limits.
If the issue persists, try separating the training and logging phases entirely.

Reduce Artifact Logging Frequency
Instead of logging artifacts at every epoch, log them at intervals:

# Log artifacts every 10 epochs instead of every epoch
if epoch % 10 == 0:
mlflow.log_artifact(artifact_path)

Re: filedescriptor out of range in select()

anirbanmishra — Wed, 04 Jun 2025 13:17:53 GMT

Than you for replying. I don't log any artifact during every epoch. I only log metrics every epoch. I try to log all artifacts at the end of training. Which is why I see the experiment finishing successfully and then these errors happening. How can I avoid the error in this case. I need most of the artifacts for the next steps and downstream analysis.

Re: filedescriptor out of range in select()

lingareddy_Alva — Wed, 04 Jun 2025 16:29:11 GMT

Hi @anirbanmishra

I see - you're logging metrics during training (which works fine) but encountering the file descriptor/Py4J gateway errors specifically when logging artifacts at the end after 350 epochs. This is actually a more complex issue because by that point, you likely have resource exhaustion from the long-running training process. Here are targeted solutions for your specific scenario:

Solution 1: Reset Spark Context Before Artifact Logging

import mlflow
from pyspark.sql import SparkSession

# After training completes but before artifact logging
def reset_spark_context():
try:
# Get current Spark context
current_spark = SparkSession.getActiveSession()
if current_spark:
current_spark.stop()

# Wait a moment for cleanup
import time
time.sleep(5)

# Create fresh Spark session
new_spark = SparkSession.builder \
.appName("ArtifactLogging") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
.getOrCreate()

return new_spark
except Exception as e:
print(f"Error resetting Spark context: {e}")
return None

# Your training workflow
with mlflow.start_run() as run:
# Training loop with metric logging
for epoch in range(350):
# ... training code ...
mlflow.log_metrics({"loss": loss, "accuracy": acc}, step=epoch)

print("Training completed, preparing to log artifacts...")

# Reset Spark context before artifact logging
spark = reset_spark_context()

# Now log artifacts
print("Logging artifacts...")
mlflow.log_artifacts("path/to/artifacts", "artifacts")

Solution 2: Chunked Artifact Logging with Delays

import time
import os
import mlflow

def log_artifacts_in_chunks(artifact_dir, chunk_size=5, delay_seconds=2):
"""Log artifacts in small chunks with delays to prevent resource exhaustion"""

artifact_files = []
for root, dirs, files in os.walk(artifact_dir):
for file in files:
artifact_files.append(os.path.join(root, file))

print(f"Total artifacts to log: {len(artifact_files)}")

# Process in chunks
for i in range(0, len(artifact_files), chunk_size):
chunk = artifact_files[i:i+chunk_size]

print(f"Logging chunk {i//chunk_size + 1}/{(len(artifact_files)-1)//chunk_size + 1}")

for artifact_path in chunk:
try:
mlflow.log_artifact(artifact_path)
print(f"Logged: {os.path.basename(artifact_path)}")
except Exception as e:
print(f"Failed to log {artifact_path}: {e}")

# Delay between chunks to let resources recover
time.sleep(delay_seconds)

# Usage after training
with mlflow.start_run() as run:
# ... training code ...

print("Training completed, logging artifacts in chunks...")
log_artifacts_in_chunks("path/to/artifacts", chunk_size=3, delay_seconds=3)

This approach should resolve the file descriptor and Py4J gateway issues you're experiencing while ensuring all your artifacts are properly logged for downstream analysis.