Databricks offers unified capabilities for both real-time and batch inference across traditional ML models and large language models (LLMs) using Mosaic AI Model Serving and AI Functions (notably the ai_query function). For your use case (n items, n models, requiring batch inference), several approaches are possible, and their efficiency and cost depend on the tools and orchestration you choose.
Best Practices for Batch Inference with Multiple ML Models
1. Use of Mosaic AI Model Serving + ai_query (Batch Inference):
- Mosaic AI Model Serving is recommended for both real-time and batch inference. It enables you to deploy classical ML models, LLMs, or custom/fine-tuned models as managed endpoints.
- The batch inference solution, particularly through the
ai_query
SQL function, is designed for large-scale, high-throughput inference. It allows you to apply any supported model (including those hosted externally, with some caveats) directly against governed data in Unity Catalog, without data movement, and is tightly integrated with Databricks' orchestration workflows and governance.
- For your scenario with multiple models, you can invoke batch inference for each model within the same pipelineāleveraging parallelism across the platform. This approach replaces slow, sequential Python loops with efficient, parallel SQL or workflow operations
2. Model Management and Cost Control:
- Endpoints are provisioned as-needed for batch jobs and can be automatically deleted or scaled down to avoid unnecessary charges.
- Throughput for batch inference is significantly higher than real-time provisioned endpoints, resulting in better price/performance when running inference over many samples or models.
- All batch inference requests are logged to provide observability and facilitate cost/usage tracking. Default throughput limits and built-in governance features help control and predict spend
3. When to Use Model Serving vs. Alternatives (UDFs or Manual Spark Jobs):
- Traditionally, users created UDFs (e.g., loading each model with
mlflow.load_model
and applying them using a Spark UDF in a loop). While technically viable, these approaches are slower and less scalable, especially for LLMs and GPU-backed models, since batch inference via Model Serving leverages high-throughput hardware and optimized compute scheduling