cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

how to speed up inference?

jeremy98
Honored Contributor

Hi guys,

I'm new to this concept, but we have several ML models that follow the same structure from the code. What I donโ€™t fully understand is how to handle different types of models efficiently โ€” right now, I need to loop through my items to get the inference for each model, and each one requires a specific inference process.
How can I speed this up using some kind of batch inference step?

1 ACCEPTED SOLUTION

Accepted Solutions

mark_ott
Databricks Employee
Databricks Employee

In Databricks, the most efficient way to handle multiple machine learning models for inference โ€” especially when each model has its own inference logic โ€” is to use batch inference with Spark DataFrames and Pandas UDFs. Instead of looping over your models sequentially in Python, you can parallelize inference across your data and model configurations using Sparkโ€™s distributed capabilities.

Batch Inference with Spark DataFrames

Databricks recommends structuring your data in a Spark DataFrame, where each row represents an item for prediction and may include metadata indicating which model to use. The workflow typically includes:

  1. Loading your data into a Spark DataFrame (from Unity Catalog, Delta tables, or external sources).

  2. Loading your models from the MLflow Model Registry.

  3. Creating spark UDFs for inference using mlflow.pyfunc.spark_udf().

  4. Applying the UDFs to your DataFrame to generate predictions in bulk.โ€‹

Example:

python
import mlflow from pyspark.sql import functions as F predict_udf = mlflow.pyfunc.spark_udf(spark, model_uri="models:/my_model/Production") df = df.withColumn("prediction", predict_udf(*df.columns)) df.write.mode("overwrite").saveAsTable("predictions_output")
 

This approach allows Spark to distribute inference tasks across multiple executors, avoiding Pythonโ€™s sequential bottlenecks.

Parallel Multi-Model Inference

When dealing with multiple models (e.g., per client or product), Databricks supports parallel batch inference using the groupBy.applyInPandas() method combined with Pandas UDFs. Each Spark worker can handle inference for a different model, allowing you to:

  • Load each model once per worker process.

  • Process subsets of data in parallel.โ€‹

Example pattern:

python
def run_inference(pdf): model_path = pdf['model_path'].iloc[0] model = mlflow.pyfunc.load_model(model_path) pdf['prediction'] = model.predict(pdf['features']) return pdf result_df = df.groupBy("model_id").applyInPandas(run_inference, schema=df.schema)

This design reduces redundant model loading and uses Sparkโ€™s distributed compute layer efficiently.

Mosaic AI & AI Functions

If your inference needs involve standard ML or LLM models, you can simplify further with AI Functions or Mosaic AI batch inference. These let you run model inference directly via SQL using functions like ai_query() without manual looping or building pipelines.

Example SQL:

sql
SELECT input_text, ai_query('my_registered_model', input_text) AS prediction FROM my_input_table

Summary of Best Practices

Technique Use Case Advantage
Spark UDFs (mlflow.pyfunc.spark_udf) Standard ML models Simplifies batch scoring
Pandas UDFs with groupBy.applyInPandas Many models (per group) Parallel per-model inference
Mosaic AI / AI Functions LLMs or unified inference Simplified SQL-based scaling
Delta Live Tables (DLT) Scheduled, repeatable jobs Automates production batch runs
 
 

In short, replace your Python loops with Spark-level Pandas UDFs or Databricks batch inference functions. This takes advantage of cluster parallelism and avoids sequential execution, allowing all your model inferences to run efficiently in parallel across nodes.

View solution in original post

2 REPLIES 2

mark_ott
Databricks Employee
Databricks Employee

In Databricks, the most efficient way to handle multiple machine learning models for inference โ€” especially when each model has its own inference logic โ€” is to use batch inference with Spark DataFrames and Pandas UDFs. Instead of looping over your models sequentially in Python, you can parallelize inference across your data and model configurations using Sparkโ€™s distributed capabilities.

Batch Inference with Spark DataFrames

Databricks recommends structuring your data in a Spark DataFrame, where each row represents an item for prediction and may include metadata indicating which model to use. The workflow typically includes:

  1. Loading your data into a Spark DataFrame (from Unity Catalog, Delta tables, or external sources).

  2. Loading your models from the MLflow Model Registry.

  3. Creating spark UDFs for inference using mlflow.pyfunc.spark_udf().

  4. Applying the UDFs to your DataFrame to generate predictions in bulk.โ€‹

Example:

python
import mlflow from pyspark.sql import functions as F predict_udf = mlflow.pyfunc.spark_udf(spark, model_uri="models:/my_model/Production") df = df.withColumn("prediction", predict_udf(*df.columns)) df.write.mode("overwrite").saveAsTable("predictions_output")
 

This approach allows Spark to distribute inference tasks across multiple executors, avoiding Pythonโ€™s sequential bottlenecks.

Parallel Multi-Model Inference

When dealing with multiple models (e.g., per client or product), Databricks supports parallel batch inference using the groupBy.applyInPandas() method combined with Pandas UDFs. Each Spark worker can handle inference for a different model, allowing you to:

  • Load each model once per worker process.

  • Process subsets of data in parallel.โ€‹

Example pattern:

python
def run_inference(pdf): model_path = pdf['model_path'].iloc[0] model = mlflow.pyfunc.load_model(model_path) pdf['prediction'] = model.predict(pdf['features']) return pdf result_df = df.groupBy("model_id").applyInPandas(run_inference, schema=df.schema)

This design reduces redundant model loading and uses Sparkโ€™s distributed compute layer efficiently.

Mosaic AI & AI Functions

If your inference needs involve standard ML or LLM models, you can simplify further with AI Functions or Mosaic AI batch inference. These let you run model inference directly via SQL using functions like ai_query() without manual looping or building pipelines.

Example SQL:

sql
SELECT input_text, ai_query('my_registered_model', input_text) AS prediction FROM my_input_table

Summary of Best Practices

Technique Use Case Advantage
Spark UDFs (mlflow.pyfunc.spark_udf) Standard ML models Simplifies batch scoring
Pandas UDFs with groupBy.applyInPandas Many models (per group) Parallel per-model inference
Mosaic AI / AI Functions LLMs or unified inference Simplified SQL-based scaling
Delta Live Tables (DLT) Scheduled, repeatable jobs Automates production batch runs
 
 

In short, replace your Python loops with Spark-level Pandas UDFs or Databricks batch inference functions. This takes advantage of cluster parallelism and avoids sequential execution, allowing all your model inferences to run efficiently in parallel across nodes.

NandiniN
Databricks Employee
Databricks Employee

Hi @jeremy98 

I have not tried this - but could using Python's multiprocessing library to assign the inference for different models to different CPU cores be something you would want to give an attempt?

Also here's a useful blog -  https://docs.databricks.com/en/machine-learning/model-serving/serve-multiple-models-to-serving-endpo... and https://www.databricks.com/blog/2022/07/20/parallel-ml-how-compass-built-a-framework-for-training-ma... 

Thanks!