Personal Insights
Here are a few quick notes to give you a sense of the kinds of topics you’ll dive into as part of this exam:
1. Generative AI Solution Development (RAG)
Prompt Engineering Primer
A good prompt generally contains 4 parts:
- Instruction — A clear directive
- Context — Background
- Input — Your specific question
- Output — Your desired structure
- Zero vs Few-shot Prompting
Zero-shot Prompting — When not using any examples, vs Few-shot Prompting — When you provide a few input-output examples
- Prompt Chaining — This allows for complex tasks to be broken into manageable steps
- Tradeoffs with prompting — Despite being simple and efficient, the output is limited by the pre-trained model’s internal knowledge. For external knowledge, RAG is needed.
Introduction to RAG
RAG helps overcome prompting limitations by passing contextual information, much like taking an exam with open notes. Chances are you’ll answer better now that you have access to a known, reliable source of data.
The first important part for RAG implementation is data prep, as garbage in = garbage out.
RAG pipeline include:
- Ingestion — Pre-processing — Data storage & Governance
- Chunking — This is use case specific. Different variants exist, including context-aware and fixed chunking. You can use either or a combination. Experiment with different chunk sizes and (basic to advanced) approaches to find your right fit. For instance, windowed summarization is a context-enriching method, where each chunk includes a ‘windowed summary’ of the previous few chunks.
- Embedding — best practice here is to choose the right embedding model based on your domain, and to use the same embedding model on the question and the retrieval side.
- Storing in Vector Database — a database that is optimized to store and retrieve high-dimensional vectors such as embeddings. In the Databricks world, there is a 3-step process to set up vector search:
Step 1 — Create a Vector Search Endpoint
Step 2 — Create a Model Serving Endpoint (optional if you want to have Databricks compute the embeddings)
Step 3 — Create a Vector Search Index
Press enter or click to view image in full size pic credits — Databricks
Evaluating a RAG application
In evaluating a rag application, you will have to check and evaluate the individual components (such as chunking performance, retrieval performance, and generator performance) along with the overall end-to-end solution.
RAG evaluation metrics include context precision, content relevancy, context recall, faithfulness, answer relevance, and answer correctness, and are based on the below 4 entities — Ground Truth, Query, Context, and Response.
Press enter or click to view image in full size pic credits — Databricks
2. Generative AI Application Development (Agents)
- Real-world prompts have multiple intents, with each intent having multiple tasks.
- You first identify the intent. And then you implement the intent using chains.
- Frameworks like LangChain help create Gen AI applications that utilize large language models
Press enter or click to view image in full size pic credits — Databricks
Agents
- An application to execute complex tasks by using language models to define a sequence of actions to take
- 4 design (agentic reassoning) patterns include react, tool use, planning (single, sequential, graph task), and multi-agent collaboration
Building Agentic Systems
To translate a business use case into an AI pipeline: Identify business goals → determine required data inputs → define expected outputs → map these to model tasks and chain components.
Pay-per-token vs Provisioned throughput
- Go with pay-per-token for low throughput and provisioned throughput for high throughput. At low usage, you only need occasional access, so pay-per-token keeps costs low by charging only for what you use.
- At high usage, the cost of pay-per-token becomes more expensive than reserving dedicated capacity, so provisioned throughput gives you a discounted, predictable rate for heavy/consistent workloads.
3. Generative AI Application Evaluation and Governance
To evaluate these complex AI systems, you will need to evaluate their components. The Data and AI Security Framework was developed to demystify AI security and is based on 12 AI components and 55 associated risks.
Two options to evaluate:
- If you have ground data set, go with Benchmarking, where you will compare models against standard evaluation data sets
- If you don’t have ground truth, define your custom metric and go with LLM-as-a-judge. Some best practices for LLM-as-a-judge :
— Use small rubric scales
— Provide a wide variety of examples
— Use a high-token LLM — more tokens equals more context
4. Generative AI Application Deployment and Monitoring
Offline vs Online Evaluation
Offline evaluation is everything that happens before launching the system in prod, whereas online evaluation is everything that happens after launching the system in prod
Evaluation vs Monitoring
In the Gen AI system lifecycle, post building your AI system, you evaluate it -> deploy it -> after which you start monitoring it.
- Evaluation: Before deployment, test models on benchmarks and datasets.
- Monitoring: After deployment, track real-world usage, drift, and performance metrics.
Press enter or click to view image in full size pic credits — Databricks
Deployment Methods
Different use cases call for different types of deployment methods, such as batch, streaming, real-time, and edge/embedded. Each of these methods comes with its tradeoffs.
Recommended LLMOps Architecture
Similar to traditional software development, it is recommended to have 3 separate environments as depicted in the picture below.
Press enter or click to view image in full size pic credits — Databricks