06-01-2025 07:49 PM - edited 06-01-2025 07:52 PM
Hi All
I am running a trining job using Mlflow and Databricks recipe. In the recipe.train step the training starts an experiment and runs for 350 epochs. After the 350 epochs are completed and I try to log the artifacts, the process gets stuck for a long time and I keep getting this error multiple times
ValueError: filedescriptor out of range in select()
Sun Jun 1 02:38:07 2025 Connection to spark from PID 4942
Sun Jun 1 02:38:08 2025 Initialized gateway on port 44549
ERROR:py4j.java_gateway:Error while waiting for a connection.
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run
readable, writable, errored = select.select(
^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
Sun Jun 1 02:38:08 2025 Connection to spark from PID 4942
Sun Jun 1 02:38:08 2025 Initialized gateway on port 35473
ERROR:py4j.java_gateway:Error while waiting for a connection.
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run
readable, writable, errored = select.select(
^^^^^^^^^^^^^^
ValueError: filedescriptor out of range in select()
During this time the CPU usage reaches almost 100% and after an hour or so the recipie.train() step fails with
Fatal error: The Python kernel is unresponsive. While through out the training step the CPU and GPU usage are below 40% mostly.
I am also using the databricks recipe to log the regular artifacts as part of the experiment.
Has anyone faced the above issue. Please let me know if any log would help in identifying the real problem.
06-03-2025 06:13 PM
This is a common issue with MLflow on Databricks, particularly when dealing with large experiments or numerous artifacts.
The "filedescriptor out of range in select()" error typically occurs due to resource exhaustion or connection pool issues with
the Py4J gateway that bridges Python and Spark/JVM.
The most effective immediate solution is usually to reduce the frequency of artifact logging and increase the file descriptor limits.
If the issue persists, try separating the training and logging phases entirely.
Reduce Artifact Logging Frequency
Instead of logging artifacts at every epoch, log them at intervals:
# Log artifacts every 10 epochs instead of every epoch
if epoch % 10 == 0:
mlflow.log_artifact(artifact_path)
06-04-2025 06:17 AM
Than you for replying. I don't log any artifact during every epoch. I only log metrics every epoch. I try to log all artifacts at the end of training. Which is why I see the experiment finishing successfully and then these errors happening. How can I avoid the error in this case. I need most of the artifacts for the next steps and downstream analysis.
06-04-2025 09:29 AM
I see - you're logging metrics during training (which works fine) but encountering the file descriptor/Py4J gateway errors specifically when logging artifacts at the end after 350 epochs. This is actually a more complex issue because by that point, you likely have resource exhaustion from the long-running training process. Here are targeted solutions for your specific scenario:
Solution 1: Reset Spark Context Before Artifact Logging
import mlflow
from pyspark.sql import SparkSession
# After training completes but before artifact logging
def reset_spark_context():
try:
# Get current Spark context
current_spark = SparkSession.getActiveSession()
if current_spark:
current_spark.stop()
# Wait a moment for cleanup
import time
time.sleep(5)
# Create fresh Spark session
new_spark = SparkSession.builder \
.appName("ArtifactLogging") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
.getOrCreate()
return new_spark
except Exception as e:
print(f"Error resetting Spark context: {e}")
return None
# Your training workflow
with mlflow.start_run() as run:
# Training loop with metric logging
for epoch in range(350):
# ... training code ...
mlflow.log_metrics({"loss": loss, "accuracy": acc}, step=epoch)
print("Training completed, preparing to log artifacts...")
# Reset Spark context before artifact logging
spark = reset_spark_context()
# Now log artifacts
print("Logging artifacts...")
mlflow.log_artifacts("path/to/artifacts", "artifacts")
Solution 2: Chunked Artifact Logging with Delays
import time
import os
import mlflow
def log_artifacts_in_chunks(artifact_dir, chunk_size=5, delay_seconds=2):
"""Log artifacts in small chunks with delays to prevent resource exhaustion"""
artifact_files = []
for root, dirs, files in os.walk(artifact_dir):
for file in files:
artifact_files.append(os.path.join(root, file))
print(f"Total artifacts to log: {len(artifact_files)}")
# Process in chunks
for i in range(0, len(artifact_files), chunk_size):
chunk = artifact_files[i:i+chunk_size]
print(f"Logging chunk {i//chunk_size + 1}/{(len(artifact_files)-1)//chunk_size + 1}")
for artifact_path in chunk:
try:
mlflow.log_artifact(artifact_path)
print(f"Logged: {os.path.basename(artifact_path)}")
except Exception as e:
print(f"Failed to log {artifact_path}: {e}")
# Delay between chunks to let resources recover
time.sleep(delay_seconds)
# Usage after training
with mlflow.start_run() as run:
# ... training code ...
print("Training completed, logging artifacts in chunks...")
log_artifacts_in_chunks("path/to/artifacts", chunk_size=3, delay_seconds=3)
This approach should resolve the file descriptor and Py4J gateway issues you're experiencing while ensuring all your artifacts are properly logged for downstream analysis.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now