Hi everyone,
I have a question regarding the concurrency limitations of streaming responses from an LLM chain via Databricks Model Serving.
When using a streaming response, the request remains open for the duration of the generation process. For example, in a RAG pipeline with streaming enabled, it might take 30-45 seconds to complete a single response. Given that the largest Databricks Model Serving compute tier supports up to 64 concurrent requests, does this mean that streaming significantly limits the overall throughput?
For instance, if each request takes 30-45 seconds, wouldn’t that effectively cap the number of requests the endpoint can handle per minute at a very low number? Or am I misunderstanding how Databricks handles concurrency in this context?
For reference, I’m very happy with the performance of Model Serving for traditional ML models, but I’m specifically evaluating its viability for LLM-based applications.
Would appreciate any insights or best practices on handling concurrency with streaming!
Thanks!