cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Replacing a Monolithic MLflow Serving Pipeline with Composed Models in Databricks

lschneid
New Contributor II

Hi everyone,

I’m a senior MLE and recently joined a company where all data science and ML workloads run on Databricks. My background is mostly MLOps on Kubernetes, so I’m currently ramping up on Databricks and trying to improve the architecture of some real-time serving models.

To be transparent, it looks like previous teams did not really leverage MLflow as a proper model registry and deployment abstraction. What they have today is essentially a full data pipeline registered as a single MLflow model and deployed via Mosaic AI Model Serving.

The current serving flow looks roughly like this:

Request

→ preprocess A

→ Model A

→ output A

→ preprocess B

→ Model B

→ output B

→ post-process

Response

Some context and constraints:

  • Model A and Model B are also registered independently in MLflow, so I assume the serving model dynamically loads them from the registry at runtime.

  • The request payload is an S3 URL, so the serving endpoint itself pulls raw data from S3.

  • This setup makes monitoring, debugging, and ownership really painful.

  • In the short term, I cannot introduce Kubernetes or Databricks streaming pipelines; I need to stick with Databricks real-time serving for now.

In my previous roles, I would have used something like BentoML model composition, where each model is served independently and composed behind an orchestration layer. https://docs.bentoml.com/en/latest/get-started/model-composition.html

Given the constraints above, I’m considering something closer to that pattern in Databricks:

  • Serve Model A and Model B as independent MLflow models and Model Serving endpoints.

  • Create a lightweight orchestration model or service that calls those endpoints in sequence.

    • Not sure if Databricks supports internal endpoint resolution or if everything would have to go through public endpoints.

  • Move heavy preprocessing and S3 data loading out of the serving layer, potentially using Databricks Feature Store.

I’d love to hear from people who have dealt with similar setups. Thanks a lot for any guidance.

1 REPLY 1

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @lschneid,

This is a common architectural evolution for ML serving on Databricks, and the platform gives you several good options for decomposing a monolithic serving pipeline into cleaner, more maintainable components. Here is a breakdown of the approach I would recommend.


OPTION 1: ORCHESTRATOR PYFUNC MODEL (RECOMMENDED FOR YOUR CASE)

You can create a lightweight custom PyFunc model that acts as the orchestrator. This model calls Model A and Model B endpoints internally. The key mechanism is that Mosaic AI Model Serving endpoints can call other serving endpoints using standard HTTP requests from within the predict() method.

Here is the general pattern:

import mlflow
import mlflow.pyfunc
import requests
import os
import json

class OrchestratorModel(mlflow.pyfunc.PythonModel):

def load_context(self, context):
self.db_host = os.environ.get("DATABRICKS_HOST")
self.db_token = os.environ.get("DATABRICKS_TOKEN")
self.model_a_endpoint = os.environ.get("MODEL_A_ENDPOINT", "model-a-endpoint")
self.model_b_endpoint = os.environ.get("MODEL_B_ENDPOINT", "model-b-endpoint")

def _call_endpoint(self, endpoint_name, payload):
url = f"{self.db_host}/serving-endpoints/{endpoint_name}/invocations"
headers = {
"Authorization": f"Bearer {self.db_token}",
"Content-Type": "application/json"
}
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
return response.json()

def predict(self, context, model_input):
# Step 1: Preprocess for Model A
preprocessed_a = self._preprocess_a(model_input)

# Step 2: Call Model A endpoint
output_a = self._call_endpoint(
self.model_a_endpoint,
{"dataframe_split": preprocessed_a}
)

# Step 3: Preprocess for Model B using output from A
preprocessed_b = self._preprocess_b(output_a)

# Step 4: Call Model B endpoint
output_b = self._call_endpoint(
self.model_b_endpoint,
{"dataframe_split": preprocessed_b}
)

# Step 5: Post-process and return
return self._postprocess(output_b)

def _preprocess_a(self, model_input):
# your preprocessing logic for Model A
pass

def _preprocess_b(self, output_a):
# your preprocessing logic for Model B
pass

def _postprocess(self, output_b):
# your post-processing logic
pass

When you create the orchestrator serving endpoint, you pass in the environment variables (DATABRICKS_HOST, DATABRICKS_TOKEN, MODEL_A_ENDPOINT, MODEL_B_ENDPOINT) through the endpoint configuration. You can set environment variables in the Serving UI under "Advanced configuration" or via the REST API when creating the endpoint. For DATABRICKS_TOKEN, you can reference a Databricks secret using the {{secrets/scope/key}} syntax so you do not hardcode credentials.

This approach gives you:
- Model A and Model B deployed as independent, versioned serving endpoints
- An orchestrator endpoint that chains them together
- Independent scaling, monitoring, and ownership for each endpoint
- The ability to update Model A or Model B independently without redeploying the orchestrator


OPTION 2: SINGLE PYFUNC WITH EMBEDDED MODELS

If the inter-endpoint latency from Option 1 is a concern (each internal call adds network round-trip time), you can keep a single endpoint but load Model A and Model B as MLflow artifacts within the orchestrator PyFunc:

import mlflow
import mlflow.pyfunc

class ComposedModel(mlflow.pyfunc.PythonModel):

def load_context(self, context):
self.model_a = mlflow.pyfunc.load_model(context.artifacts["model_a"])
self.model_b = mlflow.pyfunc.load_model(context.artifacts["model_b"])

def predict(self, context, model_input):
output_a = self.model_a.predict(self._preprocess_a(model_input))
output_b = self.model_b.predict(self._preprocess_b(output_a))
return self._postprocess(output_b)

# When logging the orchestrator model:
with mlflow.start_run():
mlflow.pyfunc.log_model(
"composed_model",
python_model=ComposedModel(),
artifacts={
"model_a": "models:/model_a_name/Production",
"model_b": "models:/model_b_name/Production"
}
)

This removes the network overhead but couples the models into one deployment artifact. It is a good middle ground if latency is critical but you still want cleaner code structure than today.


MOVING PREPROCESSING AND S3 DATA LOADING OUT OF SERVING

This is a strong instinct. For the S3 data loading in particular, consider these options:

1. Databricks Online Tables / Feature Serving: If the S3 data represents features that can be precomputed, publish them as a Feature Engineering in Unity Catalog table with an online table. Your serving endpoint can then do a feature lookup at inference time rather than pulling raw files from S3. This is the most "Databricks-native" approach.
https://docs.databricks.com/en/machine-learning/feature-store/index.html

2. Precompute and cache: If the data from S3 is relatively static, load it during load_context() rather than on every predict() call. The load_context() method runs once when the endpoint starts up, so it is the right place for heavy initialization.

3. Feature Serving endpoint: Databricks also offers standalone Feature Serving endpoints that serve precomputed features with low latency. Your orchestrator model can call a Feature Serving endpoint just like it calls Model A or Model B.


INTERNAL ENDPOINT RESOLUTION

To answer your specific question about internal vs. public endpoints: within Mosaic AI Model Serving, your custom model code makes HTTP calls to the Databricks REST API (the /serving-endpoints/{name}/invocations path). These calls go through the Databricks control plane, not over the public internet, as long as you are calling endpoints within the same workspace. You authenticate with a Databricks token (stored as a secret), and the request is routed internally.


RECOMMENDED ARCHITECTURE

Based on your constraints, here is the architecture I would suggest:

1. Deploy Model A as its own serving endpoint (registered in MLflow / Unity Catalog)
2. Deploy Model B as its own serving endpoint
3. Move S3 data loading into either:
- A precompute job that writes to an online table (feature store)
- The load_context() of the orchestrator if the data is static
4. Deploy an Orchestrator PyFunc model that:
- Accepts the request
- Looks up features from the online table (or receives them in the request)
- Calls Model A endpoint
- Transforms output and calls Model B endpoint
- Returns the final response


USEFUL DOCUMENTATION LINKS

- Mosaic AI Model Serving overview: https://docs.databricks.com/en/machine-learning/model-serving/index.html
- Custom model serving: https://docs.databricks.com/en/machine-learning/model-serving/custom-models.html
- Creating and managing serving endpoints: https://docs.databricks.com/en/machine-learning/model-serving/create-manage-serving-endpoints.html
- Feature Engineering in Unity Catalog: https://docs.databricks.com/en/machine-learning/feature-store/index.html
- MLflow PythonModel API: https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.