topic Re: How to Optimize Batch Inference for Per-Item ML Models in Databricks in Data Engineering

How to Optimize Batch Inference for Per-Item ML Models in Databricks

jeremy98 — Tue, 24 Jun 2025 10:02:35 GMT

Hi everyone, I’m relatively new to Databricks. I worked with it a few months ago, and today I encountered an issue in our system. Basically, we have multiple ML models — one for each item — and we want to run inference in a more efficient way, ideally in batch mode, instead of looping through each model sequentially. We have n items with n corresponding ML models. What would be a smart and efficient way to perform inference for all items? Is it recommended to use Model Serving and create an endpoint with Mosaic AI, or would that be unnecessarily expensive or overkill for our use case? Currently, our pipelines call the relevant ML model and run inference on a single sample record. How can we speed this up? Any advice or best practices would be greatly appreciated! Thank you!

Re: How to Optimize Batch Inference for Per-Item ML Models in Databricks

Walter_C — Tue, 24 Jun 2025 14:24:41 GMT

Databricks offers unified capabilities for both real-time and batch inference across traditional ML models and large language models (LLMs) using Mosaic AI Model Serving and AI Functions (notably the ai_query function). For your use case (n items, n models, requiring batch inference), several approaches are possible, and their efficiency and cost depend on the tools and orchestration you choose.

Best Practices for Batch Inference with Multiple ML Models

1. Use of Mosaic AI Model Serving + ai_query (Batch Inference):

Mosaic AI Model Serving is recommended for both real-time and batch inference. It enables you to deploy classical ML models, LLMs, or custom/fine-tuned models as managed endpoints.
The batch inference solution, particularly through the ai_query SQL function, is designed for large-scale, high-throughput inference. It allows you to apply any supported model (including those hosted externally, with some caveats) directly against governed data in Unity Catalog, without data movement, and is tightly integrated with Databricks' orchestration workflows and governance.
For your scenario with multiple models, you can invoke batch inference for each model within the same pipeline—leveraging parallelism across the platform. This approach replaces slow, sequential Python loops with efficient, parallel SQL or workflow operations

2. Model Management and Cost Control:

Endpoints are provisioned as-needed for batch jobs and can be automatically deleted or scaled down to avoid unnecessary charges.
Throughput for batch inference is significantly higher than real-time provisioned endpoints, resulting in better price/performance when running inference over many samples or models.
All batch inference requests are logged to provide observability and facilitate cost/usage tracking. Default throughput limits and built-in governance features help control and predict spend

3. When to Use Model Serving vs. Alternatives (UDFs or Manual Spark Jobs):

Traditionally, users created UDFs (e.g., loading each model with mlflow.load_model and applying them using a Spark UDF in a loop). While technically viable, these approaches are slower and less scalable, especially for LLMs and GPU-backed models, since batch inference via Model Serving leverages high-throughput hardware and optimized compute scheduling

Re: How to Optimize Batch Inference for Per-Item ML Models in Databricks

jeremy98 — Tue, 24 Jun 2025 14:43:32 GMT

Hi,

Thank you for the detailed answer.
Could you please provide an example notebook that shows how to set up model serving through Mosaic AI using a single endpoint, and how to implement a dispatcher (or routing) pattern to call the appropriate model directly via that endpoint?