Databricks Model Serving Scaling Logic

fede_bia — Thu, 12 Mar 2026 11:11:28 GMT

Hi everyone,

I’m seeking technical clarification on how Databricks Model Serving handles request queuing and autoscaling for CPU-intensive tasks. I am deploying a custom model for text and image extraction from PDFs (using Tesseract), and I’m struggling to ensure stability during parallel calls.

When I send 10 concurrent requests, I expect Databricks to scale out automatically to handle the parallel load. Instead, I see the following behavior:

Multiple requests are assigned to the same worker ID simultaneously.
The execution time for a standard file jumps from 8s (execution time for a 2-page PDF file) to over 160s, leading to 504 Upstream Request Timeouts.
If one of the parallel calls involves a larger PDF, it blocks the worker even longer, causing a "pile-up" effect for all subsequent or concurrent requests assigned to that same worker.

I've tried using different Compute Scale-out configurations (Small, Medium, Large, Custom). I've noticed that by setting Custom (with a high minimum concurrency, even 100), the 10 parallel calls sometimes execute, but other times they don't.

So, these are my questions:

Why does the underlying environment prioritize queuing requests into an existing worker instead of spinning up new replicas immediately when the CPU load spikes? I thought the system should scale automatically based on the incoming load.

How should I configure the endpoint to guarantee that parallel calls are distributed across different replicas to avoid interference?
Given that execution time depends on file size, is there a way to prevent a "heavy" request from bottlenecking other "lighter" requests sent at the same time?
Is there a way to limit the internal buffer of a worker so it refuses/redirects a request if it’s already processing its maximum capacity?

I want to understand the "under-the-hood" mechanism of Databricks Model Serving to ensure my extraction service remains reliable regardless of the number of parallel users or file dimensions.

Any advice on the best scaling strategy or advanced configuration for this use case would be greatly appreciated!

Thanks,

Federica

Re: Databricks Model Serving Scaling Logic

AbhaySingh — Thu, 12 Mar 2026 12:51:06 GMT

TLDR: Pre-provision min_provisioned_concurrency ≥ your peak parallel requests (in multiples of 4) with scale-to-zero disabled, and chunk large PDFs in your model code to bound per-request latency — reactive autoscaling can't help CPU-bound workloads that spike faster than new replicas can warm up.

https://docs.databricks.com/aws/en/machine-learning/model-serving/production-optimization

topic Databricks Model Serving Scaling Logic in Machine Learning

Databricks Model Serving Scaling Logic

Re: Databricks Model Serving Scaling Logic