<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to Optimize Batch Inference for Per-Item ML Models in Databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-optimize-batch-inference-for-per-item-ml-models-in/m-p/122701#M46843</link>
    <description>&lt;P class="_1t7bu9h1 paragraph"&gt;Databricks offers unified capabilities for both real-time and batch inference across traditional ML models and large language models (LLMs) using Mosaic AI Model Serving and AI Functions (notably the ai_query function). For your use case (n items, n models, requiring batch inference), several approaches are possible, and their efficiency and cost depend on the tools and orchestration you choose.&lt;/P&gt;
&lt;H3 class="_1jeaq5e0 _1t7bu9hb y728l9aj heading3 _1jeaq5e1"&gt;Best Practices for Batch Inference with Multiple ML Models&lt;/H3&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;1. Use of Mosaic AI Model Serving + ai_query (Batch Inference):&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="_1t7bu9h6 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;Mosaic AI Model Serving is recommended for both real-time and batch inference. It enables you to deploy classical ML models, LLMs, or custom/fine-tuned models as managed endpoints.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;The batch inference solution, particularly through the &lt;CODE&gt;ai_query&lt;/CODE&gt; SQL function, is designed for large-scale, high-throughput inference. It allows you to apply any supported model (including those hosted externally, with some caveats) directly against governed data in Unity Catalog, without data movement, and is tightly integrated with Databricks' orchestration workflows and governance.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;For your scenario with multiple models, you can invoke batch inference for each model within the same pipeline—leveraging parallelism across the platform. This approach replaces slow, sequential Python loops with efficient, parallel SQL or workflow operations&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;2. Model Management and Cost Control:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="_1t7bu9h6 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;Endpoints are provisioned as-needed for batch jobs and can be automatically deleted or scaled down to avoid unnecessary charges.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;Throughput for batch inference is significantly higher than real-time provisioned endpoints, resulting in better price/performance when running inference over many samples or models.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;All batch inference requests are logged to provide observability and facilitate cost/usage tracking. Default throughput limits and built-in governance features help control and predict spend&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;3. When to Use Model Serving vs. Alternatives (UDFs or Manual Spark Jobs):&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="_1t7bu9h6 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;Traditionally, users created UDFs (e.g., loading each model with &lt;CODE&gt;mlflow.load_model&lt;/CODE&gt; and applying them using a Spark UDF in a loop). While technically viable, these approaches are slower and less scalable, especially for LLMs and GPU-backed models, since batch inference via Model Serving leverages high-throughput hardware and optimized compute scheduling&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Tue, 24 Jun 2025 14:24:41 GMT</pubDate>
    <dc:creator>Walter_C</dc:creator>
    <dc:date>2025-06-24T14:24:41Z</dc:date>
    <item>
      <title>How to Optimize Batch Inference for Per-Item ML Models in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-optimize-batch-inference-for-per-item-ml-models-in/m-p/122647#M46835</link>
      <description>&lt;P&gt;Hi everyone, I’m relatively new to Databricks. I worked with it a few months ago, and today I encountered an issue in our system. Basically, we have multiple ML models — one for each item — and we want to run inference in a more efficient way, ideally in batch mode, instead of looping through each model sequentially. We have n items with n corresponding ML models. What would be a smart and efficient way to perform inference for all items? Is it recommended to use Model Serving and create an endpoint with Mosaic AI, or would that be unnecessarily expensive or overkill for our use case? Currently, our pipelines call the relevant ML model and run inference on a single sample record. How can we speed this up? Any advice or best practices would be greatly appreciated! Thank you!&lt;/P&gt;</description>
      <pubDate>Tue, 24 Jun 2025 10:02:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-optimize-batch-inference-for-per-item-ml-models-in/m-p/122647#M46835</guid>
      <dc:creator>jeremy98</dc:creator>
      <dc:date>2025-06-24T10:02:35Z</dc:date>
    </item>
    <item>
      <title>Re: How to Optimize Batch Inference for Per-Item ML Models in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-optimize-batch-inference-for-per-item-ml-models-in/m-p/122701#M46843</link>
      <description>&lt;P class="_1t7bu9h1 paragraph"&gt;Databricks offers unified capabilities for both real-time and batch inference across traditional ML models and large language models (LLMs) using Mosaic AI Model Serving and AI Functions (notably the ai_query function). For your use case (n items, n models, requiring batch inference), several approaches are possible, and their efficiency and cost depend on the tools and orchestration you choose.&lt;/P&gt;
&lt;H3 class="_1jeaq5e0 _1t7bu9hb y728l9aj heading3 _1jeaq5e1"&gt;Best Practices for Batch Inference with Multiple ML Models&lt;/H3&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;1. Use of Mosaic AI Model Serving + ai_query (Batch Inference):&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="_1t7bu9h6 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;Mosaic AI Model Serving is recommended for both real-time and batch inference. It enables you to deploy classical ML models, LLMs, or custom/fine-tuned models as managed endpoints.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;The batch inference solution, particularly through the &lt;CODE&gt;ai_query&lt;/CODE&gt; SQL function, is designed for large-scale, high-throughput inference. It allows you to apply any supported model (including those hosted externally, with some caveats) directly against governed data in Unity Catalog, without data movement, and is tightly integrated with Databricks' orchestration workflows and governance.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;For your scenario with multiple models, you can invoke batch inference for each model within the same pipeline—leveraging parallelism across the platform. This approach replaces slow, sequential Python loops with efficient, parallel SQL or workflow operations&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;2. Model Management and Cost Control:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="_1t7bu9h6 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;Endpoints are provisioned as-needed for batch jobs and can be automatically deleted or scaled down to avoid unnecessary charges.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;Throughput for batch inference is significantly higher than real-time provisioned endpoints, resulting in better price/performance when running inference over many samples or models.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;All batch inference requests are logged to provide observability and facilitate cost/usage tracking. Default throughput limits and built-in governance features help control and predict spend&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;3. When to Use Model Serving vs. Alternatives (UDFs or Manual Spark Jobs):&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="_1t7bu9h6 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;Traditionally, users created UDFs (e.g., loading each model with &lt;CODE&gt;mlflow.load_model&lt;/CODE&gt; and applying them using a Spark UDF in a loop). While technically viable, these approaches are slower and less scalable, especially for LLMs and GPU-backed models, since batch inference via Model Serving leverages high-throughput hardware and optimized compute scheduling&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Tue, 24 Jun 2025 14:24:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-optimize-batch-inference-for-per-item-ml-models-in/m-p/122701#M46843</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2025-06-24T14:24:41Z</dc:date>
    </item>
    <item>
      <title>Re: How to Optimize Batch Inference for Per-Item ML Models in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-optimize-batch-inference-for-per-item-ml-models-in/m-p/122705#M46844</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Hi,&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Thank you for the detailed answer.&lt;BR /&gt;Could you please provide an example notebook that shows how to set up model serving through Mosaic AI using a single endpoint, and how to implement a dispatcher (or routing) pattern to call the appropriate model directly via that endpoint?&lt;/P&gt;</description>
      <pubDate>Tue, 24 Jun 2025 14:43:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-optimize-batch-inference-for-per-item-ml-models-in/m-p/122705#M46844</guid>
      <dc:creator>jeremy98</dc:creator>
      <dc:date>2025-06-24T14:43:32Z</dc:date>
    </item>
  </channel>
</rss>

