<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Question about response time by Llama 3.3 70B in Generative AI</title>
    <link>https://community.databricks.com/t5/generative-ai/question-about-response-time-by-llama-3-3-70b/m-p/120336#M912</link>
    <description>&lt;P&gt;Hey everyone !&lt;/P&gt;&lt;P&gt;So I'm new into Databricks and I'm learning about the possibilities offered by Mosaic AI Foundation Model Serving. I'm mostly following the Azure's documentation to learn about it.&lt;BR /&gt;In my testing, I've created 4 unity catalog functions via SQL to help the model, Llama 3.3 70B, retrieve data in a safely manner from the tables. With this prompt: &lt;EM&gt;`What are the line item needed for orders that are in urgent need of taking care of ? And returns all of them, so you can call the tools multiple times if needed` &lt;/EM&gt;which makes call to two custom functions, I get a response time of 1 minutes and 19 seconds which seems a bit high I think. Is it a normal response time for this model or is it because I haven't really fined-tuned it yet ?&lt;BR /&gt;For my test, I use the `samples.tpch` as a playground.&lt;/P&gt;&lt;P&gt;Thanks in advance to everyone ! &lt;span class="lia-unicode-emoji" title=":smiling_face_with_smiling_eyes:"&gt;😊&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 27 May 2025 15:24:06 GMT</pubDate>
    <dc:creator>brahaman</dc:creator>
    <dc:date>2025-05-27T15:24:06Z</dc:date>
    <item>
      <title>Question about response time by Llama 3.3 70B</title>
      <link>https://community.databricks.com/t5/generative-ai/question-about-response-time-by-llama-3-3-70b/m-p/120336#M912</link>
      <description>&lt;P&gt;Hey everyone !&lt;/P&gt;&lt;P&gt;So I'm new into Databricks and I'm learning about the possibilities offered by Mosaic AI Foundation Model Serving. I'm mostly following the Azure's documentation to learn about it.&lt;BR /&gt;In my testing, I've created 4 unity catalog functions via SQL to help the model, Llama 3.3 70B, retrieve data in a safely manner from the tables. With this prompt: &lt;EM&gt;`What are the line item needed for orders that are in urgent need of taking care of ? And returns all of them, so you can call the tools multiple times if needed` &lt;/EM&gt;which makes call to two custom functions, I get a response time of 1 minutes and 19 seconds which seems a bit high I think. Is it a normal response time for this model or is it because I haven't really fined-tuned it yet ?&lt;BR /&gt;For my test, I use the `samples.tpch` as a playground.&lt;/P&gt;&lt;P&gt;Thanks in advance to everyone ! &lt;span class="lia-unicode-emoji" title=":smiling_face_with_smiling_eyes:"&gt;😊&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 27 May 2025 15:24:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/question-about-response-time-by-llama-3-3-70b/m-p/120336#M912</guid>
      <dc:creator>brahaman</dc:creator>
      <dc:date>2025-05-27T15:24:06Z</dc:date>
    </item>
    <item>
      <title>Re: Question about response time by Llama 3.3 70B</title>
      <link>https://community.databricks.com/t5/generative-ai/question-about-response-time-by-llama-3-3-70b/m-p/120421#M915</link>
      <description>&lt;P class="_1t7bu9h1 paragraph"&gt;Llama 3.3 normally offers faster inference speeds compared to earlier versions. It provides approximately 40% faster responses and reduced batch processing time&lt;/P&gt;
&lt;P&gt;However, the usual performance for Mosaic AI Model Serving are also influenced by configurations such as throughput bands, the setup for real-time or batch inference, and token usage.&amp;nbsp;&lt;/P&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;While your usage with Unity Catalog functions and custom SQL prompts adds a layer of interaction to the model performance, it's important to check the model serving conditions. If the model hasn't been fine-tuned for the specific use case or if throughput isn't optimized (e.g., low-band provisioned throughput), latency might be increased&lt;/P&gt;</description>
      <pubDate>Wed, 28 May 2025 12:38:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/generative-ai/question-about-response-time-by-llama-3-3-70b/m-p/120421#M915</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2025-05-28T12:38:59Z</dc:date>
    </item>
  </channel>
</rss>

