lingareddy_Alva
Esteemed Contributor

Hi @anirbanmishra 

This is a common issue with MLflow on Databricks, particularly when dealing with large experiments or numerous artifacts.
The "filedescriptor out of range in select()" error typically occurs due to resource exhaustion or connection pool issues with
the Py4J gateway that bridges Python and Spark/JVM.

The most effective immediate solution is usually to reduce the frequency of artifact logging and increase the file descriptor limits.
If the issue persists, try separating the training and logging phases entirely.

Reduce Artifact Logging Frequency
Instead of logging artifacts at every epoch, log them at intervals:

# Log artifacts every 10 epochs instead of every epoch
if epoch % 10 == 0:
mlflow.log_artifact(artifact_path)

 

LR