cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Nested runs don't group correctly in MLflow

dkxxx-rc
New Contributor II

How do I get MLflow child runs to appear as children of their parent run in the MLflow GUI, if I'm choosing my own experiment location instead of letting everything be written to the default experiment location?

If I run the standard tutorial (https://docs.databricks.com/_extras/notebooks/source/mlflow/mlflow-end-to-end-example-uc.html) of running parameter tuning on an XGBoost model, with logging to MLflow, the individual runs are grouped together nicely in the MLflow UI under the default experiment location:

dkxxxrc_0-1736289524445.png

But there's trouble with the nesting if I take control of the name and location of the MLflow experiment.  Say I set up an experiment location as follows:

 

EXPERIMENT_NAME = '/Users/dxxxx@realchemistry.com/MLflow_experiments/dxxxx_minimal_MLflow'

# Get the experiment ID if it exists, or create a new one
experiment_id = mlflow.get_experiment_by_name(EXPERIMENT_NAME)

if experiment_id is None:
    # If the experiment does not exist, create it
    experiment_id = mlflow.create_experiment(EXPERIMENT_NAME)
else:
    # If the experiment exists, get its ID
    experiment_id = experiment_id.experiment_id

 

If do a single model training run, using 

 

with mlflow.start_run(experiment_id=experiment_id, run_name='untuned_random_forest'):

 

the model is archived with run name untuned_random_forest to a new experiment page dxxxx_minimal_MLflow exactly as I intend.

However, trouble turns up when I try a parameter optimization job with the runs to be nested.  I set the experiment_id using 

# Run fmin within an MLflow run context so that each hyperparameter configuration is logged as a child run of a parent
# run called "xgboost_models" .
with mlflow.start_run(experiment_id=experiment_id, run_name='xgboost_models_2') as parent_run:
  run_id_value = parent_run.info.run_id
  search_space['parent_run_id'] = run_id_value
  best_params = fmin(
    fn=train_model, 
    space=search_space, 
    algo=tpe.suggest, 
    max_evals=8,
    trials=spark_trials,
  )

which invokes the defined function train_model():

def train_model(params):
  mlflow.xgboost.autolog()
  with mlflow.start_run(nested=True):
    train = xgb.DMatrix(data=X_train, label=y_train)
    validation = xgb.DMatrix(data=X_val, label=y_val)
    {et cetera}

the nesting (note nested=True) doesn't work, or at least doesn't appear to work.  The bizarre outcome is that the my experiment page gets a new run called xgboost_models_2, but it doesn't have any children.  And all the child runs are visible, but not on my experiment page -- they're only visible on the default experiment page, with no indication that they're children of anything.  If you look inside the child runs, they each have a parent_run_id that seems right, but the GUI can't seem to figure out that it should group them under the parent run on my personal experiment page.
x

1 ACCEPTED SOLUTION

Accepted Solutions

dkxxx-rc
New Contributor II

OK, here's more info about what's wrong, and a solution.

I used additional parameter logging to determine that no matter how I adjust the parameters of the inner call to 
```
mlflow.start_run()
```

the `experiment_id` parameter of the child runs differs from that of the parent runs.  It ignores `nested=True`, it ignores passing in a value of `experiment_id`, and it sets its own child `experiment_id` to a value corresponding to a new Experiment page named the same as the name of the notebook.  Therefore, since parent and children have conflicting experiment_id values, they don't group together in the GUI.

That's pretty annoying.

However, the whole problem goes away if I set an `experiment_id` value in a global sense, back at the beginning.  Specifically, in the block that sets and uses EXPERIMENT_NAME, add one more line of code at the end:
```
mlflow.set_experiment(experiment_id=experiment_id)
```
and then everything works exactly as it should.  The child runs show up as nested under the parent run in my personal Experiment space.

View solution in original post

5 REPLIES 5

Walter_C
Databricks Employee
Databricks Employee

To ensure that MLflow child runs appear as children of their parent run in the MLflow GUI when using a custom experiment location, follow these steps:

  1. Set Up the Experiment Location:

    EXPERIMENT_NAME = '/Users/dxxxx@realchemistry.com/MLflow_experiments/dxxxx_minimal_MLflow'
    
    # Get the experiment ID if it exists, or create a new one
    experiment_id = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
    
    if experiment_id is None:
        # If the experiment does not exist, create it
        experiment_id = mlflow.create_experiment(EXPERIMENT_NAME)
    else:
        # If the experiment exists, get its ID
        experiment_id = experiment_id.experiment_id
  2. Start the Parent Run:

    with mlflow.start_run(experiment_id=experiment_id, run_name='xgboost_models_2') as parent_run:
        run_id_value = parent_run.info.run_id
        search_space['parent_run_id'] = run_id_value
        best_params = fmin(
            fn=train_model, 
            space=search_space, 
            algo=tpe.suggest, 
            max_evals=8,
            trials=spark_trials,
        )
  3. Define the Training Function with Nested Runs:

    def train_model(params):
        mlflow.xgboost.autolog()
        with mlflow.start_run(nested=True):
            train = xgb.DMatrix(data=X_train, label=y_train)
            validation = xgb.DMatrix(data=X_val, label=y_val)
            # Additional training code here
  4. Ensure Correct Parent-Child Relationship:

    • Verify that the parent_run_id is correctly set in the search_space.
    • Ensure that the nested=True parameter is used in the mlflow.start_run call within the train_model function.

dkxxx-rc
New Contributor II

Hi, thanks for your response.  It doesn't seem to help at all, however.  The solution you suggest is what I've already done (including once more just now, to make sure), and it achieves the same outcome I've already described: 

  • the parent run appears on my own experiment page with no children
  • the child runs appear on the default experiment page with no parents

Let me try to provide a little more detail in case it's helpful. 

  • My latest parent run has Run ID = `5e0500d99c9d41069138d9e10fe7e83e`
  • Looking into one of the child runs, it has its own Run ID value and it has a field "Parent run" which points to the same parent run -- the value is a hyperlink to https://[redacted].cloud.databricks.com/ml/experiments/4161759641583557/runs/5e0500d99c9d41069138d9e... which points to that same parent Run ID.
  • And yet, the child runs still show up in the GUI only on the default Experiment page, not grouped with the Parent run (which is still living by itself on my Experiment page with no children).

It looks somewhat like the `nested=True` parameter is doing a good job of getting the parent run ID assigned to the child run, but the GUI isn't honoring the parent-child relationship when it decides where to display the parent and child runs.

FOOTNOTE:  You mention setting `parent_run_id` without saying what to use it for.  Do you think there's a useful way to use it?  I created it only as part of a later experiment, to try passing it as an optional argument to the inner `mlflow.start_run()` call, but it didn't seem to have any effect on the outcome.

Walter_C
Databricks Employee
Databricks Employee

When creating child runs, explicitly set the parent run ID:

def train_model(params):
mlflow.xgboost.autolog()
with mlflow.start_run(nested=True, run_name="child_run", parent_run_id=parent_run.info.run_id):
# Your existing code here

 

dkxxx-rc
New Contributor II

This has no new effect.  Still unsuccessful at grouping the child runs under the parent. 

(Which seems pretty reasonable, honestly, since as noted above, the Parent Run ID is already correctly tagged on the child runs.)

dkxxx-rc
New Contributor II

OK, here's more info about what's wrong, and a solution.

I used additional parameter logging to determine that no matter how I adjust the parameters of the inner call to 
```
mlflow.start_run()
```

the `experiment_id` parameter of the child runs differs from that of the parent runs.  It ignores `nested=True`, it ignores passing in a value of `experiment_id`, and it sets its own child `experiment_id` to a value corresponding to a new Experiment page named the same as the name of the notebook.  Therefore, since parent and children have conflicting experiment_id values, they don't group together in the GUI.

That's pretty annoying.

However, the whole problem goes away if I set an `experiment_id` value in a global sense, back at the beginning.  Specifically, in the block that sets and uses EXPERIMENT_NAME, add one more line of code at the end:
```
mlflow.set_experiment(experiment_id=experiment_id)
```
and then everything works exactly as it should.  The child runs show up as nested under the parent run in my personal Experiment space.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group