Databricks Community

lara_rachidi · ‎11-10-2024

No time to read? Check out this 10-minute recap video of all the announcements listed below for October 2024 👇

→ Subscribe to our YouTube channel for regular updates

Announcements

A production monitoring dashboard template for Agents/RAGs (public preview)

Walk through & Demo: https://youtu.be/1bV-VmIyE6U

If you have Agents in production and you struggle to monitor the quality, cost, and latency of this Agent, look at this latest release in public preview. You can use this dashboard template, by importing a notebook that you can find in our documentation. This notebook will spin up a Lakeview dashboard very quickly. What it does in a nutshell is that it simplifies the use of Agent Evaluation’s LLM judges and quality root cause analysis. It works for any Agent Framework model serving endpoint or agent logs.

Agent Evaluation will alert you to any quality regressions using our proprietary LLM judges and thumbs up 👍 or down 👎 feedback from your users. It will also classify the topics your users discuss with your agent. The feedback can arrive through the review app from stakeholders, or the feedback API on production endpoints that allows you to capture end-user reactions.

If you identify a quality issue, the MLflow Evaluation UI lets you deep dive into individual requests to identify the root cause so you can fix the issue. Then, you can verify the fix works by re-running the same analysis using Agent Evaluation from a notebook.

The dashboard enables you to slice the metrics by different dimensions, including by time, user feedback, pass/fail status, and topic of the input request (for example, to understand whether specific topics are correlated with lower-quality outputs). Additionally, you can double click into individual requests with low-quality responses to further debug them. All artifacts, such as the dashboard, are fully customizable.

What are the requirements?

The Databricks Assistant must be enabled for your workspace.
Inference tables must be enabled on the endpoint that is serving the Agent.
The notebook requires either serverless compute or a cluster running Databricks Runtime 15.2 or above. When continuously monitoring production traffic on endpoints with a large numbe of requests, we recommend setting a more frequent schedule. For instance, an hourly schedule would work well for an endpoint with more than 10,000 requests per hour and a 10% sample rate.

Production monitoring dashboard template for Agents

Mosaic AI Model Serving now supports batch LLM inference using ai_query (public preview)

Walk through & Demo: https://youtu.be/B5mRRdWkUuk

Databricks recommends using ai_query with Model Serving for batch inference. For quick experimentation, ai_query can be used with pay-per-token endpoints. When you are ready to run batch inference on large or production data, Databricks recommends using provisioned throughput endpoints for faster performance. To get started with batch inference with LLMs on Unity Catalog tables see the notebook examples in Batch inference using Foundation Model APIs provisioned throughput.

Structured outputs are now supported on Mosaic AI Model Serving as part of Foundation Model APIs (public preview)

Walk through & Demo: https://youtu.be/mIZHRqMoJec

When you get output in natural language, it can be messy, right? Even when you give LLMs precise instructions, you dont always get the exact structure that you want. It’s often important for the output to follow a certain structure, you can then parse your LLM output

Especially when you start building complex systems, you need to have those structured outputs because they can next be fed as input in the pipeline. You can always tell the LLM what schema you want in the prompt, but it’s not 100% reliable.
You can now specify a JSON schema to format responses generated from your chat models. Structured outputs is OpenAI-compatible. Databricks recommends using structured outputs for the following scenarios:
Extracting data from large amounts of documents. For example, identifying and classifying product review feedback as negative, positive or neutral.
Batch inference tasks that require outputs to be in a specified format.
Data processing, like turning unstructured data into structured data.

Other Platform updates

AI Functions powered by Foundation Model APIs are now available in EU regions: eu-west-1 and eu-central-1.
The Llama 2 70B Chat model is now retired.

Blogposts

Turbocharging GPU Inference at Logically AI

Walk through: https://youtu.be/Xdj_-rjEKdE

Summary: This blog explores how by tuning concurrent tasks per executor and pushing more tasks per GPU, Logically was able to reduce the runtime of their flagship complex models by up to 40%, by leveraging fractional GPU allocation, concurrent task execution, and merging existing partitions into a smaller number with coalesce()

Blog post link

Build Compound AI Systems Faster with Databricks Mosaic AI:

Summary: Databricks’ Mosaic AI platform now offers an AI Playground integrated with Mosaic AI Agent Evaluation for in-depth agent performance insights, enabling rapid experimentation. Additionally, it provides auto-generated Python notebooks for seamless transition from experimentation to production, facilitating easy deployment of agents with Model Serving. This integration includes automatic authentication to downstream tools and comprehensive logging for real-time monitoring and evaluation.
To ensure production-quality AI systems, Mosaic AI Gateway’s Inference Table captures detailed data on agent interactions, aiding in quality monitoring and debugging. Databricks is also developing a feature that allows foundation model endpoints in Model Serving to integrate enterprise data by selecting and executing tools, enhancing model capabilities. This feature is currently in preview for select customers.
Blog post link

The Long Context RAG Capabilities of OpenAI o1 and Google Gemini

Blog post link
Summary: Databricks' recent analysis evaluates the long-context Retrieval Augmented Generation (RAG) capabilities of OpenAI's o1 models and Google's Gemini 1.5 models. The study reveals that OpenAI's o1-preview and o1-mini models consistently outperform others in long-context RAG tasks up to 128,000 tokens, demonstrating significant improvements over previous models like GPT-4o. In contrast, Google's Gemini 1.5 models, while not matching the top performance of OpenAI's models, maintain consistent RAG performance even at extreme context lengths up to 2 million tokens. This suggests that, for corpora smaller than 2 million tokens, developers might bypass the retrieval step in RAG pipelines by directly inputting the entire dataset into the Gemini models, trading off some performance for a simplified development process.
The study also highlights distinct failure modes in long-context RAG tasks among different models. OpenAI's o1 models occasionally return empty responses when the prompt length exceeds the model's capacity due to intermediate reasoning steps. Google's Gemini models exhibit unique failure patterns, such as refusing to answer questions or indicating that the context is irrelevant, often due to strict API filtering guidelines. These findings underscore the importance of understanding model-specific behaviors and limitations when developing RAG applications, especially as context lengths continue to expand.

Introducing Simple, Fast, and Scalable Batch LLM Inference on Mosaic AI Model Serving

Blog post link
Summary: Databricks has introduced an enhanced batch inference capability within its Mosaic AI Model Serving platform, enabling organizations to efficiently process large volumes of unstructured text data using Large Language Models (LLMs). This advancement allows users to perform batch LLM inference directly on governed data without the need for data movement or preparation. By deploying AI models and executing SQL queries within the Databricks environment, businesses can seamlessly integrate LLMs into their workflows, ensuring full governance through Unity Catalog.
The new batch inference solution addresses common challenges such as complex data handling, fragmented workflows, and performance bottlenecks. It simplifies the process by enabling users to run batch inference directly within their existing workflows, eliminating the need for manual data exports and reducing operational costs. The infrastructure scales automatically to handle large workloads efficiently, with built-in fault tolerance and automatic retries. This approach streamlines the application of LLMs to tasks like information extraction, data transformation, and bulk content generation, providing a scalable and cost-effective solution for processing large datasets.

Databricks Community

Databricks GenAI & ML Announcements — October 2024