Databricks Community

brahaman · ‎05-27-2025

Hey everyone !

So I'm new into Databricks and I'm learning about the possibilities offered by Mosaic AI Foundation Model Serving. I'm mostly following the Azure's documentation to learn about it.
In my testing, I've created 4 unity catalog functions via SQL to help the model, Llama 3.3 70B, retrieve data in a safely manner from the tables. With this prompt: `What are the line item needed for orders that are in urgent need of taking care of ? And returns all of them, so you can call the tools multiple times if needed` which makes call to two custom functions, I get a response time of 1 minutes and 19 seconds which seems a bit high I think. Is it a normal response time for this model or is it because I haven't really fined-tuned it yet ?
For my test, I use the `samples.tpch` as a playground.

Thanks in advance to everyone ! 😊

Walter_C · ‎05-28-2025

Llama 3.3 normally offers faster inference speeds compared to earlier versions. It provides approximately 40% faster responses and reduced batch processing time

However, the usual performance for Mosaic AI Model Serving are also influenced by configurations such as throughput bands, the setup for real-time or batch inference, and token usage.

While your usage with Unity Catalog functions and custom SQL prompts adds a layer of interaction to the model performance, it's important to check the model serving conditions. If the model hasn't been fine-tuned for the specific use case or if throughput isn't optimized (e.g., low-band provisioned throughput), latency might be increased

Databricks Community

Question about response time by Llama 3.3 70B

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples