<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: filedescriptor out of range in select() in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/filedescriptor-out-of-range-in-select/m-p/120927#M4105</link>
    <description>&lt;P&gt;Than you for replying. I don't log any artifact during every epoch. I only log metrics every epoch. I try to log all artifacts at the end of training. Which is why I see the experiment finishing successfully and then these errors happening. How can I avoid the error in this case. I need most of the artifacts for the next steps and downstream analysis.&lt;/P&gt;</description>
    <pubDate>Wed, 04 Jun 2025 13:17:53 GMT</pubDate>
    <dc:creator>anirbanmishra</dc:creator>
    <dc:date>2025-06-04T13:17:53Z</dc:date>
    <item>
      <title>filedescriptor out of range in select()</title>
      <link>https://community.databricks.com/t5/machine-learning/filedescriptor-out-of-range-in-select/m-p/120677#M4095</link>
      <description>&lt;P&gt;Hi All&lt;/P&gt;&lt;P&gt;I am running a trining job using Mlflow and Databricks recipe. In the recipe.train step the training starts an experiment and runs for 350 epochs. After the 350 epochs are completed and I try to log the artifacts, the process gets stuck for a long time and I keep getting this error multiple times&lt;/P&gt;&lt;P&gt;ValueError: filedescriptor out of range in select()&lt;/P&gt;&lt;P&gt;Sun Jun&amp;nbsp; 1 02:38:07 2025 Connection to spark from PID&amp;nbsp; 4942&lt;/P&gt;&lt;P&gt;Sun Jun&amp;nbsp; 1 02:38:08 2025 Initialized gateway on port 44549&lt;/P&gt;&lt;P&gt;ERROR:py4j.java_gateway:Error while waiting for a connection.&lt;/P&gt;&lt;P&gt;Traceback (most recent call last):&lt;/P&gt;&lt;P&gt;&amp;nbsp; File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; readable, writable, errored = select.select(&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ^^^^^^^^^^^^^^&lt;/P&gt;&lt;P&gt;ValueError: filedescriptor out of range in select()&lt;/P&gt;&lt;P&gt;Sun Jun&amp;nbsp; 1 02:38:08 2025 Connection to spark from PID&amp;nbsp; 4942&lt;/P&gt;&lt;P&gt;Sun Jun&amp;nbsp; 1 02:38:08 2025 Initialized gateway on port 35473&lt;/P&gt;&lt;P&gt;ERROR:py4j.java_gateway:Error while waiting for a connection.&lt;/P&gt;&lt;P&gt;Traceback (most recent call last):&lt;/P&gt;&lt;P&gt;&amp;nbsp; File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 2316, in run&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; readable, writable, errored = select.select(&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ^^^^^^^^^^^^^^&lt;/P&gt;&lt;P&gt;ValueError: filedescriptor out of range in select()&lt;BR /&gt;&lt;BR /&gt;During this time the CPU usage reaches almost 100% and after an hour or so the recipie.train() step fails with&lt;BR /&gt;&lt;SPAN class=""&gt;Fatal error: &lt;/SPAN&gt;&lt;SPAN&gt;The Python kernel is unresponsive. While through out the training step the CPU and GPU usage are below 40% mostly.&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;I am also using the databricks recipe to log the regular artifacts as part of the experiment.&lt;BR /&gt;&lt;BR /&gt;Has anyone faced the above issue. Please let me know if any log would help in identifying the real problem.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jun 2025 02:52:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/filedescriptor-out-of-range-in-select/m-p/120677#M4095</guid>
      <dc:creator>anirbanmishra</dc:creator>
      <dc:date>2025-06-02T02:52:51Z</dc:date>
    </item>
    <item>
      <title>Re: filedescriptor out of range in select()</title>
      <link>https://community.databricks.com/t5/machine-learning/filedescriptor-out-of-range-in-select/m-p/120866#M4100</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/166778"&gt;@anirbanmishra&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This is a common issue with MLflow on Databricks, particularly when dealing with large experiments or numerous artifacts.&lt;BR /&gt;The "filedescriptor out of range in select()" error typically occurs due to resource exhaustion or connection pool issues with&lt;BR /&gt;the Py4J gateway that bridges Python and Spark/JVM.&lt;/P&gt;&lt;P&gt;The most effective immediate solution is usually to reduce the frequency of artifact logging and increase the file descriptor limits.&lt;BR /&gt;If the issue persists, try separating the training and logging phases entirely.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Reduce Artifact Logging Frequency&lt;/STRONG&gt;&lt;BR /&gt;Instead of logging artifacts at every epoch, log them at intervals:&lt;/P&gt;&lt;P&gt;# Log artifacts every 10 epochs instead of every epoch&lt;BR /&gt;if epoch % 10 == 0:&lt;BR /&gt;mlflow.log_artifact(artifact_path)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 04 Jun 2025 01:13:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/filedescriptor-out-of-range-in-select/m-p/120866#M4100</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-06-04T01:13:22Z</dc:date>
    </item>
    <item>
      <title>Re: filedescriptor out of range in select()</title>
      <link>https://community.databricks.com/t5/machine-learning/filedescriptor-out-of-range-in-select/m-p/120927#M4105</link>
      <description>&lt;P&gt;Than you for replying. I don't log any artifact during every epoch. I only log metrics every epoch. I try to log all artifacts at the end of training. Which is why I see the experiment finishing successfully and then these errors happening. How can I avoid the error in this case. I need most of the artifacts for the next steps and downstream analysis.&lt;/P&gt;</description>
      <pubDate>Wed, 04 Jun 2025 13:17:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/filedescriptor-out-of-range-in-select/m-p/120927#M4105</guid>
      <dc:creator>anirbanmishra</dc:creator>
      <dc:date>2025-06-04T13:17:53Z</dc:date>
    </item>
    <item>
      <title>Re: filedescriptor out of range in select()</title>
      <link>https://community.databricks.com/t5/machine-learning/filedescriptor-out-of-range-in-select/m-p/120952#M4107</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/166778"&gt;@anirbanmishra&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I see - you're logging metrics during training (which works fine) but encountering the file descriptor/Py4J gateway errors specifically when logging artifacts at the end after 350 epochs. This is actually a more complex issue because by that point, you likely have resource exhaustion from the long-running training process. Here are targeted solutions for your specific scenario:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Solution 1: Reset Spark Context Before Artifact Logging&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;import mlflow&lt;BR /&gt;from pyspark.sql import SparkSession&lt;/P&gt;&lt;P&gt;# After training completes but before artifact logging&lt;BR /&gt;def reset_spark_context():&lt;BR /&gt;try:&lt;BR /&gt;# Get current Spark context&lt;BR /&gt;current_spark = SparkSession.getActiveSession()&lt;BR /&gt;if current_spark:&lt;BR /&gt;current_spark.stop()&lt;BR /&gt;&lt;BR /&gt;# Wait a moment for cleanup&lt;BR /&gt;import time&lt;BR /&gt;time.sleep(5)&lt;BR /&gt;&lt;BR /&gt;# Create fresh Spark session&lt;BR /&gt;new_spark = SparkSession.builder \&lt;BR /&gt;.appName("ArtifactLogging") \&lt;BR /&gt;.config("spark.sql.adaptive.enabled", "true") \&lt;BR /&gt;.config("spark.sql.adaptive.coalescePartitions.enabled", "true") \&lt;BR /&gt;.getOrCreate()&lt;BR /&gt;&lt;BR /&gt;return new_spark&lt;BR /&gt;except Exception as e:&lt;BR /&gt;print(f"Error resetting Spark context: {e}")&lt;BR /&gt;return None&lt;/P&gt;&lt;P&gt;# Your training workflow&lt;BR /&gt;with mlflow.start_run() as run:&lt;BR /&gt;# Training loop with metric logging&lt;BR /&gt;for epoch in range(350):&lt;BR /&gt;# ... training code ...&lt;BR /&gt;mlflow.log_metrics({"loss": loss, "accuracy": acc}, step=epoch)&lt;BR /&gt;&lt;BR /&gt;print("Training completed, preparing to log artifacts...")&lt;BR /&gt;&lt;BR /&gt;# Reset Spark context before artifact logging&lt;BR /&gt;spark = reset_spark_context()&lt;BR /&gt;&lt;BR /&gt;# Now log artifacts&lt;BR /&gt;print("Logging artifacts...")&lt;BR /&gt;mlflow.log_artifacts("path/to/artifacts", "artifacts")&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Solution 2: Chunked Artifact Logging with Delays&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;import time&lt;BR /&gt;import os&lt;BR /&gt;import mlflow&lt;/P&gt;&lt;P&gt;def log_artifacts_in_chunks(artifact_dir, chunk_size=5, delay_seconds=2):&lt;BR /&gt;"""Log artifacts in small chunks with delays to prevent resource exhaustion"""&lt;BR /&gt;&lt;BR /&gt;artifact_files = []&lt;BR /&gt;for root, dirs, files in os.walk(artifact_dir):&lt;BR /&gt;for file in files:&lt;BR /&gt;artifact_files.append(os.path.join(root, file))&lt;BR /&gt;&lt;BR /&gt;print(f"Total artifacts to log: {len(artifact_files)}")&lt;BR /&gt;&lt;BR /&gt;# Process in chunks&lt;BR /&gt;for i in range(0, len(artifact_files), chunk_size):&lt;BR /&gt;chunk = artifact_files[i:i+chunk_size]&lt;BR /&gt;&lt;BR /&gt;print(f"Logging chunk {i//chunk_size + 1}/{(len(artifact_files)-1)//chunk_size + 1}")&lt;BR /&gt;&lt;BR /&gt;for artifact_path in chunk:&lt;BR /&gt;try:&lt;BR /&gt;mlflow.log_artifact(artifact_path)&lt;BR /&gt;print(f"Logged: {os.path.basename(artifact_path)}")&lt;BR /&gt;except Exception as e:&lt;BR /&gt;print(f"Failed to log {artifact_path}: {e}")&lt;BR /&gt;&lt;BR /&gt;# Delay between chunks to let resources recover&lt;BR /&gt;time.sleep(delay_seconds)&lt;/P&gt;&lt;P&gt;# Usage after training&lt;BR /&gt;with mlflow.start_run() as run:&lt;BR /&gt;# ... training code ...&lt;BR /&gt;&lt;BR /&gt;print("Training completed, logging artifacts in chunks...")&lt;BR /&gt;log_artifacts_in_chunks("path/to/artifacts", chunk_size=3, delay_seconds=3)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This approach should resolve the file descriptor and Py4J gateway issues you're experiencing while ensuring all your artifacts are properly logged for downstream analysis.&lt;/P&gt;</description>
      <pubDate>Wed, 04 Jun 2025 16:29:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/filedescriptor-out-of-range-in-select/m-p/120952#M4107</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2025-06-04T16:29:11Z</dc:date>
    </item>
  </channel>
</rss>

