Databricks Community

josh_melton

Screenshot 2024-06-26 at 4.17.11 PM.png

Introduction

Databricks and DSPy can help overcome common challenges of creating Compound AI systems, including ones tasked with writing blog articles like the one you are reading right now. The examples in this article use Databricks Foundation Models, DSPy, and MLflow to build and deploy a blog writing AI system, and as we’ll see, DSPy makes model selection less important by decomposing an AI-driven task into abstract pieces. For the end to end code of this initial version, clone the “Blog Post Generator” notebook in this repo.

Writing technical articles requires spending time organizing ideas about solutions to customer challenges into a useful, well designed structure. Plugging observations into an LLM can be a helpful starting point, but oftentimes the output isn’t structured in a consistent or practical way. Furthermore, the text isn’t oriented towards solving a customer problem or simply isn’t engaging. Raw models repeat vaguely useful generic information rather than using paragraphs that follow a coherent structure to illustrate and support the central topic, and they’re often verbose when framing that central topic.

To overcome these sorts of issues it’s common to use prompt engineering tactics with a generic model. For one type of output the trick might be to beg the model to follow instructions, and for another it might be to coerce the model with threats. For certain problem spaces you might need to tell the model to take a breath and think through it, and for others you might provide few-shot examples to guide the response.

DSPy eliminates brittle ties to specific models and helps avoid the issue of careful prompt engineering becoming obsolete with the latest model release. We'll use the problem of writing the outline and draft of a Databricks blog post as an example. The example system does provide some value already - we've connected it to our team's internal blog post idea board and the system did, in fact, provide a useful starting point! Furthermore, we conclude with some goals for improvement in future versions which could enhance the usefulness of the system, and would be relatively easy to implement thanks to the capabilities of DSPy and Databricks.

DSPy Setup and Modules

To get started, we'll install the DSPy library, set up the DBRX foundation model to be used by DSPy by default, and define llama3-70b as a teacher that we’ll use later on. This setup is the first step towards optimizing the end to end compound system with no manual prompting or evaluation required, and no output is expected from the code snippet below.

%pip install dspy-ai mlflow --upgrade -q
dbutils.library.restartPython()

import dspy

# Retrieve the access token and url for our model serving endpoints
token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
url = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiUrl().get() + '/serving-endpoints'

# Set the models to be used in our examples
lm = dspy.Databricks(model='databricks-dbrx-instruct', model_type='chat', api_key=token, api_base=url, max_tokens=1000)
teacher = dspy.Databricks(model='databricks-meta-llama-3-70b-instruct', model_type='chat', api_key=token, api_base=url, max_tokens=1000)
dspy.settings.configure(lm=lm)

Next we’ll create Modules, which are the building blocks of DSPy pipelines. Modules use Signatures to define the expectations for the shape of our inputs and outputs, and provide the developer various strategies to produce the desired output.

When I write a blog post, I typically start with a collection of thoughts and try to organize those thoughts into a more structured outline. Then, for each section of my outline, I write a paragraph and check the documentation for guidance on details when required.

We can instruct DSPy to do the same by defining two modules, one for writing an outline and the other for writing paragraphs. Both of these simple modules can use DSPy’s ChainOfThought to elicit deeper thinking from the model given those inputs. Defining a Signature and Module for our SectionToParagraph operation leverages Python classes that will feel familiar to PyTorch users, like below. We’ve also included a retriever model which leverages a Databricks Vector Search Index containing the Databricks documentation to allow our model to access information not included in its training data. The final section of this code runs a test input through the class and produces a paragraph for the given section and topic.

Note that alongside the paragraph, the ChainOfThought module produces a rationale explaining the thought process that produced the paragraph. Instead of testing prompts like “think step by step” vs “take a breath and think through the result”, we can let DSPy take the wheel and focus on our higher level abstractions.

class SectionToParagraphSig(dspy.Signature):
   """Convert one section of an outline to a paragraph"""
   section = dspy.InputField(desc="A short section of an outline describing some supporting idea for the intended topic")
   topic = dspy.InputField(desc="The overall topic for a technical blog post")
   context = dspy.InputField(desc="Context related to the section to help with detailed writing")
   paragraph = dspy.OutputField(desc="A paragraph providing a detailed explanation of some supporting idea the topic and why it's relevant")

class SectionToParagraph(dspy.Module):
   def __init__(self, docs_rm):
       super().__init__()
       self.docs_rm = docs_rm
       self.prog = dspy.ChainOfThought(SectionToParagraphSig)

   def forward(self, section, topic):
       context_list = self.docs_rm(query=section, query_type="text").docs
       context = "\n".join(context_list)
       return self.prog(section=section, topic=topic, context=context)

test_section = "a. Introduction to the approach of making certain applyInPandas operations faster b. Explanation of generating dummy data for the example using Spark c. Code for generating the initial dataframe with specified number of rows, devices, and trips"
test_topic = "Optimizing the performance of applyInPandas operations in Spark by combining distributed processing with in-memory processing using custom aggregators."
unoptimized_paragrapher = SectionToParagraph()
pred = unoptimized_paragrapher(test_section, test_topic)
print(pred)
# prints dspy.Prediction(rationale=’...’, paragraph=’...’)

Module Evaluation and Optimization

Our next step is to create a golden dataset to use for training and evaluation of our system. In this example, we can start with a simple copy/paste from some of our favorite Databricks Community technical blogs. A short script (found in the helpers folder of the repo) can be used to extract reasonable topics, outlines, and paragraphs from the original blogs. After some manual proofreading, we can upload a csv with our data to the workspace and make it available for use. The workspace makes it simpler to share the data from a git folder, although in production it’d be more effective and secure to use a Volume to store this data. We can also scale the curation of our data more effectively with the Databricks Mosaic Agent Framework, which has tooling to enable our SMEs to label data and correct the responses from our system. In any case, we use a list of DSPy Examples to store the golden dataset which we’ll use to optimize our system later.

import pandas as pd

paragraphs_golden_dataset = pd.read_csv('./artifacts/blog_drafter/sections_and_paragraphs.csv')

paragraph_train_cutoff = int(len(paragraphs_golden_dataset) * .6)

paragraph_dataset = [dspy.Example(section=row['Section'], topic=row['Topic'], paragraph=row['Paragraph']).with_inputs('section', 'topic')

          for i, row in paragraphs_golden_dataset.iterrows()]

paragraph_trainset = paragraph_dataset[:paragraph_train_cutoff]

paragraph_testset = paragraph_dataset[paragraph_train_cutoff:]

To set up our optimization, we need to create a measurement of how well our system is performing so we can have a metric to optimize for. We could add simple calculations such as the length of the paragraph. However, we’ll define a more subjective measure to gauge how well we’re mitigating some of the challenges posed in “vanilla” prompt engineering that were mentioned in the introduction. We’ll do this by using an Assess signature, which will accept a piece of text and a question about that piece of text, then use a language model to ascertain a yes or no answer to that question. We’ll use this in our custom metric to score whether the produced text is engaging, appropriately structured, and solves a specific problem. Below, we evaluate the raw model’s outputs against our metric as a baseline.

class Assess(dspy.Signature):
   """Assess the quality of a piece of text along the specified dimension."""
   text_to_assess = dspy.InputField(desc="Piece of text to assess")
   assessment_question = dspy.InputField(desc="Question to answer about the text")
   assessment_answer = dspy.OutputField(desc="Yes or No")
# Note: as we gather more data we could optimize this module too!
assessor = dspy.ChainOfThought(Assess)

from dspy.evaluate import Evaluate

def paragraph_metric(gold, pred, trace=None):
   gold_paragraph, topic = gold.paragraph, gold.topic
   paragraph = pred.paragraph
   clarity = f"Is the given paragraph clear, concise, and does it have continuity similar to the following paragraph? \n Paragraph: \n {gold_paragraph}"
   support = "Does the paragraph clearly articulate a supporting point about how Databricks or data more generally solves some problem?"
   example = "Is the paragraph either an introduction, conclusion, or it is a supporting paragraph with a code example to illustrate its point?"
   detailed = "Does the paragraph provide excellent detail about the overall point, rather than being generic or repetitive similar to the following paragraph? \n Paragraph: \n {gold_paragraph}"
   aligned = f"Is the paragraph aligned to the following topic? \n Topic: {topic}?"
   with dspy.context(lm=teacher): # Use the teacher to grade the student model
       evals =  [assessor(text_to_assess=paragraph, assessment_question=question)
                 for question in [clarity, support, example, detailed, aligned]]
   score = sum(['yes' in e.assessment_answer.lower() for e in evals])
   return score / len(evals)

# Evaluate the baseline model using our custom metric
evaluate = Evaluate(devset=paragraph_testset, metric=paragraph_metric, num_threads=4, display_progress=False, display_table=0)
paragraph_baseline_results = evaluate(SectionToParagraph())
print(paragraph_baseline_results)

Now that we’ve set our baseline, we can finally tune the system using DSPy Optimizers. Since we don’t have an extensive dataset, we’ll use the BootstrapFewShot optimizer to generate extra examples for the system to use in few-shot prompts to maximize our metric. The optimizer will leverage our golden dataset examples as a guide for generating the few-shot prompt that is most aligned to the qualitative questions in our metric, as determined by the teacher model.

You might notice the BootstrapFewShot class comes from the “teleprompt” portion of the DSPy library - Modules were originally called teleprompters since their prompting is analogous to an LLM reading from a teleprompter.

We also leverage the teacher model, llama3-70b, to generate the bootstrapped examples. This approach can be especially effective when training a smaller model to be as productive as a larger teacher model. In any case, once we’ve optimized our Module we can compare to the baseline. Results can vary widely due to the non-deterministic nature of LLMs and small sample size, but after several trials mine centered around a 17% improvement (sometimes slightly negative, occasionally hugely positive).

from dspy.teleprompt import BootstrapFewShot


# Set up the optimizer: we want to "bootstrap" (i.e., self-generate) to 4 examples of our CoT program.
config = dict(max_bootstrapped_demos=1, max_labeled_demos=3, teacher_settings=dict({'lm': teacher}))

# Optimize! The metric is going to tell the optimizer how well it's doing according to our qualitative statements
optimizer = BootstrapFewShot(metric=paragraph_metric, **config)
optimized_paragrapher = optimizer.compile(SectionToParagraph(), trainset=paragraph_trainset)

evaluate = Evaluate(devset=outline_testset, metric=outline_metric, num_threads=4, display_progress=False, display_table=0)
optimized_outline_results = evaluate(optimized_outliner)
print(optimized_outline_results)

improvement = (optimized_outline_results / outline_baseline_results) - 1
print(f"% improvement: {improvement * 100}")
# Results may vary widely - mine are around 17% improvement

Model Registry and Deployment

Finally, we can use the model to generate an outline and blog post draft. This snippet assumes each of the above steps were also carried out for a separate AbstractToOutline Module (illustrated in full in the repo’s end to end example). We parse the generated outline into sections and write one paragraph for each. All of these steps are saved as a MLflow Pyfunc class. We’ll initialize the class and pass it our original test abstract to validate the results.

import mlflow
import os


os.environ['token'] = token
os.environ['url'] = url
class DSPyWrapper(mlflow.pyfunc.PythonModel):
   def __init__(self, outliner, paragrapher):
       self.outliner = outliner
       self.paragrapher = paragrapher

   def load_context(self, context):
       self.dspy_setup()

   def dspy_setup(self):
       import dspy
       import os
       url = os.environ['url']
       token = os.environ['token']
       lm = dspy.Databricks(model='databricks-dbrx-instruct', model_type='chat', api_key=token, api_base=url, max_tokens=1000)
       dspy.settings.configure(lm=lm)

   def parse_outline(self, outline):
       import re
       output = re.split(r'\d+[a-zA-Z]?\.', outline)
       return [line.strip() for line in output if line.strip()]

   def draft_blog(self, row):
       outline_pred = self.outliner(row['abstract'])
       outline, topic = outline_pred.outline, outline_pred.topic
       outline_sections = self.parse_outline(outline)
       paragraphs = [self.paragrapher(section=section, topic=topic).paragraph
                     for section in outline_sections
                     if len(section.strip()) > 5]
       return pd.Series([outline, topic, paragraphs])

   def predict(self, context, input_df):
       output = input_df.apply(self.draft_blog, axis=1, result_type='expand')
       output.columns = ['outline', 'topic', 'paragraphs']
       return output

mlflow_dspy_model = DSPyWrapper(optimized_outliner, optimized_paragrapher)
input_data = pd.DataFrame({'abstract': [test_abstract]})
pred = mlflow_dspy_model.predict(None, input_data)
display(pred)
# outline | topic | paragraphs
# …       | …     | …

In our final step we’ll log the MLflow model and our calculated metrics, register the model to Unity Catalog, and validate that downstream consumers can use the model in their pipelines. The output should match the previous step. Unity Catalog makes it simple to share the model across teams or workspaces. This model could also be promoted as a model serving endpoint for real time REST API calls. Databricks model serving makes it simple to connect third party applications such as a team’s Jira board so the model can automatically add blog post drafts to tickets.

from mlflow.models import infer_signature
import pkg_resources


dspy_version = pkg_resources.get_distribution("dspy-ai").version
signature = infer_signature(input_data, pred)

with mlflow.start_run() as run:
   mlflow.pyfunc.log_model(artifact_path="model", python_model=DSPyWrapper(optimized_outliner, optimized_paragrapher), signature=signature,
                           input_example=input_data, extra_pip_requirements=["dspy=={dspy_version}"])
   mlflow.log_metric("outline_metric", optimized_outline_results)
   mlflow.log_metric("paragraph_metric", optimized_paragraph_results)

mlflow.set_registry_uri("databricks-uc")
model_path = "josh_melton.blogs.blog_post_drafter"
latest_model = mlflow.register_model(f"runs:/{run.info.run_id}/model", name=model_path)
client = mlflow.client.MlflowClient()
client.set_registered_model_alias(name=model_path, alias="Production", version=latest_model.version)

model_path = "josh_melton.blogs.blog_post_drafter"
registered_model_uri = f"models:/{model_path}@Production"
model = mlflow.pyfunc.load_model(registered_model_uri)
display(model.predict(input_data))
# outline | topic | paragraphs
# …       | …     | …

Conclusion

In this article, we’ve learned how to leverage DSPy to “program, not prompt” your language models. We’ve created a pipeline that leverages golden examples to optimize our prompts to tailor the outputs to a custom metric. In addition to removing the manual and error prone prompt engineering from our LLM development, we’ve improved the score on our metric which is representative of the specific qualitative aspects of the output that we’re looking for.

In the short term, my team has integrated this basic model into our internal kanban board to automatically generate the outlines and drafts of our blog post ideas. This will allow us to collect more data and curate higher quality examples using Databricks Mosaic’s AI Agent Framework. In the future, we could also add separate modules for introductions and conclusions or try different hyperparameters, optimizers and metric definitions for the existing DSPy system. Unfortunately, the outline generated for this particular blog post was limited by the LLM’s knowledge of DSPy and the small dataset used for optimization.

Future work could try to replicate the results with a smaller fine tuned model to reduce cost for the same performance, or integrate tool use to provide more capabilities such as web search to address. We could also improve our retrieval using reranking or query expansion as described in our GenAI Cookbook. Finally, we could use more DSPy capabilities such as ProgramOfThought to iterate over code examples or automate the collection of more blog post data and use the sophisticated optimizers such as MIPRO. Subscribe to our technical blog for all of these updates!

Databricks Community

Models Don't Matter: Building Compound AI Systems with DSPy and Databricks

UC Catalog Cloning: An Automated Approach

5 Common Pitfalls to Avoid when Migrating to Unity Catalog

Top 10 query performance tuning tips for Databricks Serverless SQL