cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Table-Model Lineage for models without online Feature Lookups

ssequ
New Contributor II

Hi community,

I am looking for the recommended way to achieve table-model lineage in Unity Catalog for models that don't use Feature Lookups but only offline features. 

When I use FeatureEngineeringClient.create_training_set with feature_lookups + mlflow experiment tracking, this works well and the respective feature stores show up in the model lineage. However, I haven't found a way to use offline features only.

Tracking an mlflow model without FeatureEngineeringClient.create_training_set works but then the lineage doesn't show up in Unity. Passing an empty list as the feature_lookups results in 

 WARNING databricks.ml_features._catalog_client._catalog_client_helper: Failed to record consumer in the catalog. Exception: {'error_code': 'NOT_FOUND', 'message': 'Workspace Feature Store has been deprecated in the current workspace. Databricks recommends using Feature Engineering in Unity Catalog.

and the lineage won't show up either. This is particularly weird since there is no such warning when I pass actual FeatureLookups instead of the empty list.

Thanks for any help

#featurestore #mlflow

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Hey @ssequ  sorry this fell through the cracks but I have some ideas for you to consider.

 
You can get Unity Catalog table→model lineage without Feature Lookups by logging the training datasets to MLflow and registering the model in Unity Catalog.
 

Recommended approach (offline features only)

 
Use MLflow dataset logging to record the UC tables you trained/evaluated on, then register the model to Unity Catalog:
  • Ensure you’re on MLflow ≥ 2.11; table→model lineage uses mlflow.log_input and is supported from 2.11 onward.
  • Load your training data from UC tables and create MLflow dataset objects (for example with mlflow.data.load_delta) so lineage can resolve to UC assets.
  • Call mlflow.log_input(dataset, context="training") for each upstream table or for a snapshot table you create for training, then log and register your model to UC; lineage will appear on the model version’s Lineage tab in Catalog Explorer.
  • Include a model signature (either provide it or let MLflow infer it via input_example) because UC requires model versions to have signatures when registering.

Minimal example

 
```python import mlflow from sklearn.ensemble import RandomForestClassifier
 
# 1) Load UC table(s) used for training and create MLflow dataset(s) dataset = mlflow.data.load_delta(table_name="prod.ml_team.features_customer_churn", version="42") pdf = dataset.df.toPandas() X = pdf.drop(columns=["label"]) y = pdf["label"]
with mlflow.start_run():
# 2) Train clf = RandomForestClassifier(max_depth=7, n_estimators=200) clf.fit(X, y)
# 3) Log the training dataset for lineage
mlflow.log_input(dataset, context="training")
 
# 4) Log + register the model in Unity Catalog (three-level name) input_example = X.iloc[[0]] mlflow.sklearn.log_model( sk_model=clf, name="model", input_example=input_example, registered_model_name="prod.ml_team.churn_rf" ) ```
 

Notes on your current behavior

 
  • The warning you saw with an empty feature_lookups list is the WS Feature Store deprecation path; passing actual FeatureLookups uses the Feature Engineering in UC path that auto-captures lineage. If you don’t want Feature Lookups, skip FeatureEngineeringClient and use mlflow.log_input to capture lineage from offline UC tables.

Variations and best practices

 
  • If your training data is built from multiple offline tables, log each source: - mlflow.log_input(mlflow.data.load_delta(table_name="catalog.schema.tableA", version="..."), "training") - mlflow.log_input(mlflow.data.load_delta(table_name="catalog.schema.tableB", version="..."), "training")
  • If you train on an ephemeral DataFrame (not a UC table), persist a snapshot to UC first (for reproducibility and lineage), then load and log that snapshot with a version number.
  • You can also log evaluation datasets:
    • mlflow.log_input(dataset_eval, context="evaluation")
  • Make sure your MLflow client is configured to target UC (MLflow 3 defaults to databricks-uc, or set registry URI explicitly) and that you use the three‑level registered_model_name when logging.
 
Hope this helps, Louis.