cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

Model Serving and Streaming

Kjetil
Contributor

Hi everyone,

I have a question regarding the concurrency limitations of streaming responses from an LLM chain via Databricks Model Serving.

When using a streaming response, the request remains open for the duration of the generation process. For example, in a RAG pipeline with streaming enabled, it might take 30-45 seconds to complete a single response. Given that the largest Databricks Model Serving compute tier supports up to 64 concurrent requests, does this mean that streaming significantly limits the overall throughput?

For instance, if each request takes 30-45 seconds, wouldn’t that effectively cap the number of requests the endpoint can handle per minute at a very low number? Or am I misunderstanding how Databricks handles concurrency in this context?

For reference, I’m very happy with the performance of Model Serving for traditional ML models, but I’m specifically evaluating its viability for LLM-based applications.

Would appreciate any insights or best practices on handling concurrency with streaming!

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Mantsama4
Contributor III

Great question! You’re absolutely right that streaming responses can keep requests open for the duration of the generation process, which may introduce concurrency limitations. That said, Databricks Model Serving provides several optimizations and best practices to help maximize throughput and efficiently handle concurrent streaming requests. Here are some suggestions:

  1. Asynchronous Streaming and Request Batching

    • Use asynchronous endpoints to avoid blocking resources for extended periods.

    • Where possible, batch multiple queries into a single request to reduce the number of concurrent sessions.

  2. Dynamic Scaling of Compute Resources

    • Leverage auto-scaling clusters in Databricks to dynamically allocate resources based on traffic demand.

    • If you’re using serverless Model Serving, consider increasing the serving capacity (e.g., upgrading to a higher SKU) to handle higher loads.

  3. Optimize Token Generation for Faster Responses

    • Fine-tune parameters like temperature, max tokens, and top-k sampling to balance response quality and generation speed.

    • Use caching mechanisms, such as the Databricks Feature Store, to store and retrieve frequently used responses instead of regenerating them.

  4. Parallelize Requests and Use Multiple Endpoints

    • For high-traffic scenarios, deploy multiple Model Serving endpoints and distribute requests using a load balancer.

    • Partition workloads across multiple LLM instances to enable multi-threaded execution of response streaming.

  5. Hybrid Model Serving with Edge Processing

    • Offload preprocessing tasks to edge devices or client-side applications to reduce latency before invoking the LLM.

    • Precompute frequently requested embeddings or responses using the Databricks Feature Store for retrieval-augmented generation (RAG) workflows.

  6. Monitor and Optimize Throughput

    • Use Databricks System Tables and the Metrics API to monitor response latency and concurrency usage.

    • Optimize indexing in Vector Search to reduce retrieval times in RAG pipelines before calling the LLM.

Final Considerations
If you’re integrating Azure OpenAI as part of your LLM pipeline, keep in mind that Databricks Model Serving acts as a proxy, so Azure OpenAI’s streaming latency will also play a role. One potential workaround is to pre-generate responses for high-frequency queries using a combination of the Databricks Feature Store and Vector Search Index, which can deliver results faster.

Hope this helps!

Mantu S

View solution in original post

4 REPLIES 4

Kjetil
Contributor

I should also mention that the LLM Chain calls an Azure Open AI endpoint as a final step which then streams the answer back to the client. This means that each request remains open for the response generation.

Mantsama4
Contributor III

Great question! You’re absolutely right that streaming responses can keep requests open for the duration of the generation process, which may introduce concurrency limitations. That said, Databricks Model Serving provides several optimizations and best practices to help maximize throughput and efficiently handle concurrent streaming requests. Here are some suggestions:

  1. Asynchronous Streaming and Request Batching

    • Use asynchronous endpoints to avoid blocking resources for extended periods.

    • Where possible, batch multiple queries into a single request to reduce the number of concurrent sessions.

  2. Dynamic Scaling of Compute Resources

    • Leverage auto-scaling clusters in Databricks to dynamically allocate resources based on traffic demand.

    • If you’re using serverless Model Serving, consider increasing the serving capacity (e.g., upgrading to a higher SKU) to handle higher loads.

  3. Optimize Token Generation for Faster Responses

    • Fine-tune parameters like temperature, max tokens, and top-k sampling to balance response quality and generation speed.

    • Use caching mechanisms, such as the Databricks Feature Store, to store and retrieve frequently used responses instead of regenerating them.

  4. Parallelize Requests and Use Multiple Endpoints

    • For high-traffic scenarios, deploy multiple Model Serving endpoints and distribute requests using a load balancer.

    • Partition workloads across multiple LLM instances to enable multi-threaded execution of response streaming.

  5. Hybrid Model Serving with Edge Processing

    • Offload preprocessing tasks to edge devices or client-side applications to reduce latency before invoking the LLM.

    • Precompute frequently requested embeddings or responses using the Databricks Feature Store for retrieval-augmented generation (RAG) workflows.

  6. Monitor and Optimize Throughput

    • Use Databricks System Tables and the Metrics API to monitor response latency and concurrency usage.

    • Optimize indexing in Vector Search to reduce retrieval times in RAG pipelines before calling the LLM.

Final Considerations
If you’re integrating Azure OpenAI as part of your LLM pipeline, keep in mind that Databricks Model Serving acts as a proxy, so Azure OpenAI’s streaming latency will also play a role. One potential workaround is to pre-generate responses for high-frequency queries using a combination of the Databricks Feature Store and Vector Search Index, which can deliver results faster.

Hope this helps!

Mantu S

Kjetil
Contributor

Great answer! I really appreciate you taking the time to help!

LaurenFletcher
New Contributor II

Wow, Thank you so much for your help.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group