<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Databricks Model Serving Scaling Logic in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/databricks-model-serving-scaling-logic/m-p/150675#M4579</link>
    <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I’m seeking technical clarification on how Databricks Model Serving handles request queuing and autoscaling for CPU-intensive tasks. I am deploying a custom model for text and image extraction from PDFs (using Tesseract), and I’m struggling to ensure stability during parallel calls.&lt;/P&gt;&lt;P&gt;When I send 10 concurrent requests, I expect Databricks to scale out automatically to handle the parallel load. Instead, I see the following behavior:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Multiple requests are assigned to the same worker ID simultaneously.&lt;/LI&gt;&lt;LI&gt;The execution time for a standard file jumps from 8s (execution time for a 2-page PDF file) to over 160s, leading to 504 Upstream Request Timeouts.&lt;/LI&gt;&lt;LI&gt;If one of the parallel calls involves a larger PDF, it blocks the worker even longer, causing a "pile-up" effect for all subsequent or concurrent requests assigned to that same worker.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I've tried using different Compute Scale-out configurations (Small, Medium, Large, Custom). I've noticed that by setting Custom (with a high minimum concurrency, even 100), the 10 parallel calls sometimes execute, but other times they don't.&lt;/P&gt;&lt;P&gt;So, these are my questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Why does the underlying environment prioritize queuing requests into an existing worker instead of spinning up new replicas immediately when the CPU load spikes? I thought the system should scale automatically based on the incoming load.&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;How should I configure the endpoint to guarantee that parallel calls are distributed across different replicas to avoid interference?&lt;/LI&gt;&lt;LI&gt;Given that execution time depends on file size, is there a way to prevent a "heavy" request from bottlenecking other "lighter" requests sent at the same time?&lt;/LI&gt;&lt;LI&gt;Is there a way to limit the internal buffer of a worker so it refuses/redirects a request if it’s already processing its maximum capacity?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I want to understand the "under-the-hood" mechanism of Databricks Model Serving to ensure my extraction service remains reliable regardless of the number of parallel users or file dimensions.&lt;/P&gt;&lt;P&gt;Any advice on the best scaling strategy or advanced configuration for this use case would be greatly appreciated!&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Federica&lt;/P&gt;</description>
    <pubDate>Thu, 12 Mar 2026 11:11:28 GMT</pubDate>
    <dc:creator>fede_bia</dc:creator>
    <dc:date>2026-03-12T11:11:28Z</dc:date>
    <item>
      <title>Databricks Model Serving Scaling Logic</title>
      <link>https://community.databricks.com/t5/machine-learning/databricks-model-serving-scaling-logic/m-p/150675#M4579</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I’m seeking technical clarification on how Databricks Model Serving handles request queuing and autoscaling for CPU-intensive tasks. I am deploying a custom model for text and image extraction from PDFs (using Tesseract), and I’m struggling to ensure stability during parallel calls.&lt;/P&gt;&lt;P&gt;When I send 10 concurrent requests, I expect Databricks to scale out automatically to handle the parallel load. Instead, I see the following behavior:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Multiple requests are assigned to the same worker ID simultaneously.&lt;/LI&gt;&lt;LI&gt;The execution time for a standard file jumps from 8s (execution time for a 2-page PDF file) to over 160s, leading to 504 Upstream Request Timeouts.&lt;/LI&gt;&lt;LI&gt;If one of the parallel calls involves a larger PDF, it blocks the worker even longer, causing a "pile-up" effect for all subsequent or concurrent requests assigned to that same worker.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I've tried using different Compute Scale-out configurations (Small, Medium, Large, Custom). I've noticed that by setting Custom (with a high minimum concurrency, even 100), the 10 parallel calls sometimes execute, but other times they don't.&lt;/P&gt;&lt;P&gt;So, these are my questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Why does the underlying environment prioritize queuing requests into an existing worker instead of spinning up new replicas immediately when the CPU load spikes? I thought the system should scale automatically based on the incoming load.&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;&lt;LI&gt;How should I configure the endpoint to guarantee that parallel calls are distributed across different replicas to avoid interference?&lt;/LI&gt;&lt;LI&gt;Given that execution time depends on file size, is there a way to prevent a "heavy" request from bottlenecking other "lighter" requests sent at the same time?&lt;/LI&gt;&lt;LI&gt;Is there a way to limit the internal buffer of a worker so it refuses/redirects a request if it’s already processing its maximum capacity?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;I want to understand the "under-the-hood" mechanism of Databricks Model Serving to ensure my extraction service remains reliable regardless of the number of parallel users or file dimensions.&lt;/P&gt;&lt;P&gt;Any advice on the best scaling strategy or advanced configuration for this use case would be greatly appreciated!&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Federica&lt;/P&gt;</description>
      <pubDate>Thu, 12 Mar 2026 11:11:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/databricks-model-serving-scaling-logic/m-p/150675#M4579</guid>
      <dc:creator>fede_bia</dc:creator>
      <dc:date>2026-03-12T11:11:28Z</dc:date>
    </item>
    <item>
      <title>Re: Databricks Model Serving Scaling Logic</title>
      <link>https://community.databricks.com/t5/machine-learning/databricks-model-serving-scaling-logic/m-p/150688#M4580</link>
      <description>&lt;P&gt;&lt;STRONG&gt;TLDR:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;Pre-provision&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;min_provisioned_concurrency&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;≥ your peak parallel requests (in multiples of 4) with scale-to-zero disabled, and chunk large PDFs in your model code to bound per-request latency — reactive autoscaling can't help CPU-bound workloads that spike faster than new replicas can warm up.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;A href="https://docs.databricks.com/aws/en/machine-learning/model-serving/production-optimization" target="_blank"&gt;https://docs.databricks.com/aws/en/machine-learning/model-serving/production-optimization&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 12 Mar 2026 12:51:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/databricks-model-serving-scaling-logic/m-p/150688#M4580</guid>
      <dc:creator>AbhaySingh</dc:creator>
      <dc:date>2026-03-12T12:51:06Z</dc:date>
    </item>
  </channel>
</rss>

