Agent Evaluation will alert you to any quality regressions using our proprietary LLM judges and thumbs up 👍 or down 👎 feedback from your users. It will also classify the topics your users discuss with your agent. The feedback can arrive through thereview appfrom stakeholders, or thefeedback APIon production endpoints that allows you to capture end-user reactions.
If you identify a quality issue, the MLflow Evaluation UI lets you deep dive into individual requests to identify the root cause so you can fix the issue. Then, you can verify the fix works by re-running the same analysis using Agent Evaluation from a notebook.
The dashboard enables you to slice the metrics by different dimensions, including by time, user feedback, pass/fail status, and topic of the input request (for example, to understand whether specific topics are correlated with lower-quality outputs). Additionally, you can double click into individual requests with low-quality responses to further debug them. All artifacts, such as the dashboard, are fully customizable.
What are the requirements?
The Databricks Assistant must beenabledfor your workspace.
Inference tablesmust be enabled on the endpoint that is serving the Agent.
The notebook requires either serverless compute or a cluster running Databricks Runtime 15.2 or above. When continuously monitoring production traffic on endpoints with a large numbe of requests, we recommend setting a more frequent schedule. For instance, an hourly schedule would work well for an endpoint with more than 10,000 requests per hour and a 10% sample rate.
Mosaic AI Model Serving nowsupportsbatch LLM inference using ai_query (public preview)
Especially when you start building complex systems, you need to have those structured outputs because they can next be fed as input in the pipeline. You can always tell the LLM what schema you want in the prompt, but it’s not 100% reliable.
You can now specify a JSON schema to format responses generated from your chat models. Structured outputs is OpenAI-compatible. Databricks recommends using structured outputs for the following scenarios:
Extracting data from large amounts of documents. For example, identifying and classifying product review feedback as negative, positive or neutral.
Batch inference tasks that require outputs to be in a specified format.
Data processing, like turning unstructured data into structured data.
Other Platform updates
AI Functionspowered by Foundation Model APIs are nowavailablein EU regions: eu-west-1 and eu-central-1.
Build Compound AI Systems Faster with Databricks Mosaic AI:
Summary: Databricks’ Mosaic AI platform now offers an AI Playground integrated with Mosaic AI Agent Evaluation for in-depth agent performance insights, enabling rapid experimentation. Additionally, it provides auto-generated Python notebooks for seamless transition from experimentation to production, facilitating easy deployment of agents with Model Serving. This integration includes automatic authentication to downstream tools and comprehensive logging for real-time monitoring and evaluation.
To ensure production-quality AI systems, Mosaic AI Gateway’s Inference Table captures detailed data on agent interactions, aiding in quality monitoring and debugging. Databricks is also developing a feature that allows foundation model endpoints in Model Serving to integrate enterprise data by selecting and executing tools, enhancing model capabilities. This feature is currently in preview for select customers.
Summary: Databricks' recent analysis evaluates the long-context Retrieval Augmented Generation (RAG) capabilities of OpenAI's o1 models and Google's Gemini 1.5 models. The study reveals that OpenAI's o1-preview and o1-mini models consistently outperform others in long-context RAG tasks up to 128,000 tokens, demonstrating significant improvements over previous models like GPT-4o. In contrast, Google's Gemini 1.5 models, while not matching the top performance of OpenAI's models, maintain consistent RAG performance even at extreme context lengths up to 2 million tokens. This suggests that, for corpora smaller than 2 million tokens, developers might bypass the retrieval step in RAG pipelines by directly inputting the entire dataset into the Gemini models, trading off some performance for a simplified development process.
The study also highlights distinct failure modes in long-context RAG tasks among different models. OpenAI's o1 models occasionally return empty responses when the prompt length exceeds the model's capacity due to intermediate reasoning steps. Google's Gemini models exhibit unique failure patterns, such as refusing to answer questions or indicating that the context is irrelevant, often due to strict API filtering guidelines. These findings underscore the importance of understanding model-specific behaviors and limitations when developing RAG applications, especially as context lengths continue to expand.
Introducing Simple, Fast, and Scalable Batch LLM Inference on Mosaic AI Model Serving
Summary: Databricks has introduced an enhanced batch inference capability within its Mosaic AI Model Serving platform, enabling organizations to efficiently process large volumes of unstructured text data using Large Language Models (LLMs). This advancement allows users to perform batch LLM inference directly on governed data without the need for data movement or preparation. By deploying AI models and executing SQL queries within the Databricks environment, businesses can seamlessly integrate LLMs into their workflows, ensuring full governance through Unity Catalog.
The new batch inference solution addresses common challenges such as complex data handling, fragmented workflows, and performance bottlenecks. It simplifies the process by enabling users to run batch inference directly within their existing workflows, eliminating the need for manual data exports and reducing operational costs. The infrastructure scales automatically to handle large workloads efficiently, with built-in fault tolerance and automatic retries. This approach streamlines the application of LLMs to tasks like information extraction, data transformation, and bulk content generation, providing a scalable and cost-effective solution for processing large datasets.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.