Hi @lschneid,
This is a common architectural evolution for ML serving on Databricks, and the platform gives you several good options for decomposing a monolithic serving pipeline into cleaner, more maintainable components. Here is a breakdown of the approach I would recommend.
OPTION 1: ORCHESTRATOR PYFUNC MODEL (RECOMMENDED FOR YOUR CASE)
You can create a lightweight custom PyFunc model that acts as the orchestrator. This model calls Model A and Model B endpoints internally. The key mechanism is that Mosaic AI Model Serving endpoints can call other serving endpoints using standard HTTP requests from within the predict() method.
Here is the general pattern:
import mlflow
import mlflow.pyfunc
import requests
import os
import json
class OrchestratorModel(mlflow.pyfunc.PythonModel):
def load_context(self, context):
self.db_host = os.environ.get("DATABRICKS_HOST")
self.db_token = os.environ.get("DATABRICKS_TOKEN")
self.model_a_endpoint = os.environ.get("MODEL_A_ENDPOINT", "model-a-endpoint")
self.model_b_endpoint = os.environ.get("MODEL_B_ENDPOINT", "model-b-endpoint")
def _call_endpoint(self, endpoint_name, payload):
url = f"{self.db_host}/serving-endpoints/{endpoint_name}/invocations"
headers = {
"Authorization": f"Bearer {self.db_token}",
"Content-Type": "application/json"
}
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
return response.json()
def predict(self, context, model_input):
# Step 1: Preprocess for Model A
preprocessed_a = self._preprocess_a(model_input)
# Step 2: Call Model A endpoint
output_a = self._call_endpoint(
self.model_a_endpoint,
{"dataframe_split": preprocessed_a}
)
# Step 3: Preprocess for Model B using output from A
preprocessed_b = self._preprocess_b(output_a)
# Step 4: Call Model B endpoint
output_b = self._call_endpoint(
self.model_b_endpoint,
{"dataframe_split": preprocessed_b}
)
# Step 5: Post-process and return
return self._postprocess(output_b)
def _preprocess_a(self, model_input):
# your preprocessing logic for Model A
pass
def _preprocess_b(self, output_a):
# your preprocessing logic for Model B
pass
def _postprocess(self, output_b):
# your post-processing logic
pass
When you create the orchestrator serving endpoint, you pass in the environment variables (DATABRICKS_HOST, DATABRICKS_TOKEN, MODEL_A_ENDPOINT, MODEL_B_ENDPOINT) through the endpoint configuration. You can set environment variables in the Serving UI under "Advanced configuration" or via the REST API when creating the endpoint. For DATABRICKS_TOKEN, you can reference a Databricks secret using the {{secrets/scope/key}} syntax so you do not hardcode credentials.
This approach gives you:
- Model A and Model B deployed as independent, versioned serving endpoints
- An orchestrator endpoint that chains them together
- Independent scaling, monitoring, and ownership for each endpoint
- The ability to update Model A or Model B independently without redeploying the orchestrator
OPTION 2: SINGLE PYFUNC WITH EMBEDDED MODELS
If the inter-endpoint latency from Option 1 is a concern (each internal call adds network round-trip time), you can keep a single endpoint but load Model A and Model B as MLflow artifacts within the orchestrator PyFunc:
import mlflow
import mlflow.pyfunc
class ComposedModel(mlflow.pyfunc.PythonModel):
def load_context(self, context):
self.model_a = mlflow.pyfunc.load_model(context.artifacts["model_a"])
self.model_b = mlflow.pyfunc.load_model(context.artifacts["model_b"])
def predict(self, context, model_input):
output_a = self.model_a.predict(self._preprocess_a(model_input))
output_b = self.model_b.predict(self._preprocess_b(output_a))
return self._postprocess(output_b)
# When logging the orchestrator model:
with mlflow.start_run():
mlflow.pyfunc.log_model(
"composed_model",
python_model=ComposedModel(),
artifacts={
"model_a": "models:/model_a_name/Production",
"model_b": "models:/model_b_name/Production"
}
)
This removes the network overhead but couples the models into one deployment artifact. It is a good middle ground if latency is critical but you still want cleaner code structure than today.
MOVING PREPROCESSING AND S3 DATA LOADING OUT OF SERVING
This is a strong instinct. For the S3 data loading in particular, consider these options:
1. Databricks Online Tables / Feature Serving: If the S3 data represents features that can be precomputed, publish them as a Feature Engineering in Unity Catalog table with an online table. Your serving endpoint can then do a feature lookup at inference time rather than pulling raw files from S3. This is the most "Databricks-native" approach.
https://docs.databricks.com/en/machine-learning/feature-store/index.html
2. Precompute and cache: If the data from S3 is relatively static, load it during load_context() rather than on every predict() call. The load_context() method runs once when the endpoint starts up, so it is the right place for heavy initialization.
3. Feature Serving endpoint: Databricks also offers standalone Feature Serving endpoints that serve precomputed features with low latency. Your orchestrator model can call a Feature Serving endpoint just like it calls Model A or Model B.
INTERNAL ENDPOINT RESOLUTION
To answer your specific question about internal vs. public endpoints: within Mosaic AI Model Serving, your custom model code makes HTTP calls to the Databricks REST API (the /serving-endpoints/{name}/invocations path). These calls go through the Databricks control plane, not over the public internet, as long as you are calling endpoints within the same workspace. You authenticate with a Databricks token (stored as a secret), and the request is routed internally.
RECOMMENDED ARCHITECTURE
Based on your constraints, here is the architecture I would suggest:
1. Deploy Model A as its own serving endpoint (registered in MLflow / Unity Catalog)
2. Deploy Model B as its own serving endpoint
3. Move S3 data loading into either:
- A precompute job that writes to an online table (feature store)
- The load_context() of the orchestrator if the data is static
4. Deploy an Orchestrator PyFunc model that:
- Accepts the request
- Looks up features from the online table (or receives them in the request)
- Calls Model A endpoint
- Transforms output and calls Model B endpoint
- Returns the final response
USEFUL DOCUMENTATION LINKS
- Mosaic AI Model Serving overview: https://docs.databricks.com/en/machine-learning/model-serving/index.html
- Custom model serving: https://docs.databricks.com/en/machine-learning/model-serving/custom-models.html
- Creating and managing serving endpoints: https://docs.databricks.com/en/machine-learning/model-serving/create-manage-serving-endpoints.html
- Feature Engineering in Unity Catalog: https://docs.databricks.com/en/machine-learning/feature-store/index.html
- MLflow PythonModel API: https://mlflow.org/docs/latest/python_api/mlflow.pyfunc.html
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.