Hi everyone,
Iām seeking technical clarification on how Databricks Model Serving handles request queuing and autoscaling for CPU-intensive tasks. I am deploying a custom model for text and image extraction from PDFs (using Tesseract), and Iām struggling to ensure stability during parallel calls.
When I send 10 concurrent requests, I expect Databricks to scale out automatically to handle the parallel load. Instead, I see the following behavior:
- Multiple requests are assigned to the same worker ID simultaneously.
- The execution time for a standard file jumps from 8s (execution time for a 2-page PDF file) to over 160s, leading to 504 Upstream Request Timeouts.
- If one of the parallel calls involves a larger PDF, it blocks the worker even longer, causing a "pile-up" effect for all subsequent or concurrent requests assigned to that same worker.
I've tried using different Compute Scale-out configurations (Small, Medium, Large, Custom). I've noticed that by setting Custom (with a high minimum concurrency, even 100), the 10 parallel calls sometimes execute, but other times they don't.
So, these are my questions:
- Why does the underlying environment prioritize queuing requests into an existing worker instead of spinning up new replicas immediately when the CPU load spikes? I thought the system should scale automatically based on the incoming load.
- How should I configure the endpoint to guarantee that parallel calls are distributed across different replicas to avoid interference?
- Given that execution time depends on file size, is there a way to prevent a "heavy" request from bottlenecking other "lighter" requests sent at the same time?
- Is there a way to limit the internal buffer of a worker so it refuses/redirects a request if itās already processing its maximum capacity?
I want to understand the "under-the-hood" mechanism of Databricks Model Serving to ensure my extraction service remains reliable regardless of the number of parallel users or file dimensions.
Any advice on the best scaling strategy or advanced configuration for this use case would be greatly appreciated!
Thanks,
Federica