Question about response time by Llama 3.3 70B

brahaman — Tue, 27 May 2025 15:24:06 GMT

Hey everyone !

So I'm new into Databricks and I'm learning about the possibilities offered by Mosaic AI Foundation Model Serving. I'm mostly following the Azure's documentation to learn about it.
In my testing, I've created 4 unity catalog functions via SQL to help the model, Llama 3.3 70B, retrieve data in a safely manner from the tables. With this prompt: `What are the line item needed for orders that are in urgent need of taking care of ? And returns all of them, so you can call the tools multiple times if needed` which makes call to two custom functions, I get a response time of 1 minutes and 19 seconds which seems a bit high I think. Is it a normal response time for this model or is it because I haven't really fined-tuned it yet ?
For my test, I use the `samples.tpch` as a playground.

Thanks in advance to everyone ! 😊

Re: Question about response time by Llama 3.3 70B

Walter_C — Wed, 28 May 2025 12:38:59 GMT

Llama 3.3 normally offers faster inference speeds compared to earlier versions. It provides approximately 40% faster responses and reduced batch processing time

However, the usual performance for Mosaic AI Model Serving are also influenced by configurations such as throughput bands, the setup for real-time or batch inference, and token usage.

While your usage with Unity Catalog functions and custom SQL prompts adds a layer of interaction to the model performance, it's important to check the model serving conditions. If the model hasn't been fine-tuned for the specific use case or if throughput isn't optimized (e.g., low-band provisioned throughput), latency might be increased

topic Question about response time by Llama 3.3 70B in Generative AI

Question about response time by Llama 3.3 70B

Re: Question about response time by Llama 3.3 70B