- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Hi everyone,
I have a question regarding the concurrency limitations of streaming responses from an LLM chain via Databricks Model Serving.
When using a streaming response, the request remains open for the duration of the generation process. For example, in a RAG pipeline with streaming enabled, it might take 30-45 seconds to complete a single response. Given that the largest Databricks Model Serving compute tier supports up to 64 concurrent requests, does this mean that streaming significantly limits the overall throughput?
For instance, if each request takes 30-45 seconds, wouldn’t that effectively cap the number of requests the endpoint can handle per minute at a very low number? Or am I misunderstanding how Databricks handles concurrency in this context?
For reference, I’m very happy with the performance of Model Serving for traditional ML models, but I’m specifically evaluating its viability for LLM-based applications.
Would appreciate any insights or best practices on handling concurrency with streaming!
Thanks!
- Labels:
-
Generative AI
-
Model Serving
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Great question! You’re absolutely right that streaming responses can keep requests open for the duration of the generation process, which may introduce concurrency limitations. That said, Databricks Model Serving provides several optimizations and best practices to help maximize throughput and efficiently handle concurrent streaming requests. Here are some suggestions:
Asynchronous Streaming and Request Batching
Use asynchronous endpoints to avoid blocking resources for extended periods.
Where possible, batch multiple queries into a single request to reduce the number of concurrent sessions.
Dynamic Scaling of Compute Resources
Leverage auto-scaling clusters in Databricks to dynamically allocate resources based on traffic demand.
If you’re using serverless Model Serving, consider increasing the serving capacity (e.g., upgrading to a higher SKU) to handle higher loads.
Optimize Token Generation for Faster Responses
Fine-tune parameters like temperature, max tokens, and top-k sampling to balance response quality and generation speed.
Use caching mechanisms, such as the Databricks Feature Store, to store and retrieve frequently used responses instead of regenerating them.
Parallelize Requests and Use Multiple Endpoints
For high-traffic scenarios, deploy multiple Model Serving endpoints and distribute requests using a load balancer.
Partition workloads across multiple LLM instances to enable multi-threaded execution of response streaming.
Hybrid Model Serving with Edge Processing
Offload preprocessing tasks to edge devices or client-side applications to reduce latency before invoking the LLM.
Precompute frequently requested embeddings or responses using the Databricks Feature Store for retrieval-augmented generation (RAG) workflows.
Monitor and Optimize Throughput
Use Databricks System Tables and the Metrics API to monitor response latency and concurrency usage.
Optimize indexing in Vector Search to reduce retrieval times in RAG pipelines before calling the LLM.
Final Considerations
If you’re integrating Azure OpenAI as part of your LLM pipeline, keep in mind that Databricks Model Serving acts as a proxy, so Azure OpenAI’s streaming latency will also play a role. One potential workaround is to pre-generate responses for high-frequency queries using a combination of the Databricks Feature Store and Vector Search Index, which can deliver results faster.
Hope this helps!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
I should also mention that the LLM Chain calls an Azure Open AI endpoint as a final step which then streams the answer back to the client. This means that each request remains open for the response generation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Great question! You’re absolutely right that streaming responses can keep requests open for the duration of the generation process, which may introduce concurrency limitations. That said, Databricks Model Serving provides several optimizations and best practices to help maximize throughput and efficiently handle concurrent streaming requests. Here are some suggestions:
Asynchronous Streaming and Request Batching
Use asynchronous endpoints to avoid blocking resources for extended periods.
Where possible, batch multiple queries into a single request to reduce the number of concurrent sessions.
Dynamic Scaling of Compute Resources
Leverage auto-scaling clusters in Databricks to dynamically allocate resources based on traffic demand.
If you’re using serverless Model Serving, consider increasing the serving capacity (e.g., upgrading to a higher SKU) to handle higher loads.
Optimize Token Generation for Faster Responses
Fine-tune parameters like temperature, max tokens, and top-k sampling to balance response quality and generation speed.
Use caching mechanisms, such as the Databricks Feature Store, to store and retrieve frequently used responses instead of regenerating them.
Parallelize Requests and Use Multiple Endpoints
For high-traffic scenarios, deploy multiple Model Serving endpoints and distribute requests using a load balancer.
Partition workloads across multiple LLM instances to enable multi-threaded execution of response streaming.
Hybrid Model Serving with Edge Processing
Offload preprocessing tasks to edge devices or client-side applications to reduce latency before invoking the LLM.
Precompute frequently requested embeddings or responses using the Databricks Feature Store for retrieval-augmented generation (RAG) workflows.
Monitor and Optimize Throughput
Use Databricks System Tables and the Metrics API to monitor response latency and concurrency usage.
Optimize indexing in Vector Search to reduce retrieval times in RAG pipelines before calling the LLM.
Final Considerations
If you’re integrating Azure OpenAI as part of your LLM pipeline, keep in mind that Databricks Model Serving acts as a proxy, so Azure OpenAI’s streaming latency will also play a role. One potential workaround is to pre-generate responses for high-frequency queries using a combination of the Databricks Feature Store and Vector Search Index, which can deliver results faster.
Hope this helps!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Great answer! I really appreciate you taking the time to help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a week ago
Wow, Thank you so much for your help.

