cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Model Serving Scaling Logic

fede_bia
Databricks Partner

Hi everyone,

I’m seeking technical clarification on how Databricks Model Serving handles request queuing and autoscaling for CPU-intensive tasks. I am deploying a custom model for text and image extraction from PDFs (using Tesseract), and I’m struggling to ensure stability during parallel calls.

When I send 10 concurrent requests, I expect Databricks to scale out automatically to handle the parallel load. Instead, I see the following behavior:

  • Multiple requests are assigned to the same worker ID simultaneously.
  • The execution time for a standard file jumps from 8s (execution time for a 2-page PDF file) to over 160s, leading to 504 Upstream Request Timeouts.
  • If one of the parallel calls involves a larger PDF, it blocks the worker even longer, causing a "pile-up" effect for all subsequent or concurrent requests assigned to that same worker.

I've tried using different Compute Scale-out configurations (Small, Medium, Large, Custom). I've noticed that by setting Custom (with a high minimum concurrency, even 100), the 10 parallel calls sometimes execute, but other times they don't.

So, these are my questions:

  • Why does the underlying environment prioritize queuing requests into an existing worker instead of spinning up new replicas immediately when the CPU load spikes? I thought the system should scale automatically based on the incoming load.
  • How should I configure the endpoint to guarantee that parallel calls are distributed across different replicas to avoid interference?
  • Given that execution time depends on file size, is there a way to prevent a "heavy" request from bottlenecking other "lighter" requests sent at the same time?
  • Is there a way to limit the internal buffer of a worker so it refuses/redirects a request if it’s already processing its maximum capacity?

I want to understand the "under-the-hood" mechanism of Databricks Model Serving to ensure my extraction service remains reliable regardless of the number of parallel users or file dimensions.

Any advice on the best scaling strategy or advanced configuration for this use case would be greatly appreciated!

Thanks,

Federica

1 REPLY 1

AbhaySingh
Databricks Employee
Databricks Employee

TLDR: Pre-provision min_provisioned_concurrency ≥ your peak parallel requests (in multiples of 4) with scale-to-zero disabled, and chunk large PDFs in your model code to bound per-request latency — reactive autoscaling can't help CPU-bound workloads that spike faster than new replicas can warm up.

https://docs.databricks.com/aws/en/machine-learning/model-serving/production-optimization