How do you classify billions of free-text entries when there's no “golden dataset” to train on? We can use LLMs for their impressive zero-shot capabilities, but they’re resource-intensive and slow for real-time inference. In this article, we bootstrap a faster supervised model on LLM-generated classifications and leverage vector similarity search as a fallback or enhancement. For the full code example, try this repo.
The opportunity for enterprises to capitalize on vast collections of unstructured text— such as warranty claims, maintenance logs, and incident reports— is hamstrung by the lack of labels to power traditional supervised machine learning models. To bridge this gap, businesses have turned to manual data-labeling services. However, for large or critical datasets this is not feasible. Instead, businesses can consider using an LLM from within their data platform so the data is kept in-house. While it’s easy to throw a mass of data at an LLM using something like AI Query for batch inference, this may not be feasible or cost-effective for datasets that span billions of rows. This article presents a robust and scalable pipeline to classify large scale unstructured data using Databricks and LLMs, even when labeled data is sparse or nonexistent.
Clone the repo into your Databricks workspace, or walk through the code on your own, as we outline three approaches to classification:
After generating some sample pilot notes text for our example, we use AI_QUERY() with the databricks-meta-llama-3-3-70b-instruct foundation model to automatically label pilot notes with one of several predefined flight phases. LLM labeling often has frustrating extra narration, such as “Certainly! The label is: LABEL”, but by providing a specific schema we constrain the outputs to exactly what we require: choosing only from allowed categories (e.g., TAKEOFF, LANDING), and returning a confidence_level score (integer between 0 and 1). In this approach, we also add more deterministic confidence metrics for the predicted category given the source text, like the semantic similarity and the levenshtein distance. These metrics provide confidence in some labels, highlight examples to investigate in others, and can be used to inform which labels will make the best training dataset for distilling future models. Note that AI_QUERY scales to enable massive batch inference with LLMs, but at some point models with many billions of parameters will not be the most efficient approach to classification. For maximal scale or minimal latency, we'll use the labels generated in approach 1 to train a smaller, faster, and cheaper model in the next approach.
With a robust LLM-labeled dataset in hand, we train several supervised models (XGBoost, Logistic Regrssion, and Naive Bayes) using Spark ML and compare their results to obtain the best one. We follow some standard NLP preprocessing steps, such as tokenization and TF-IDF, to generate the final features for this approach. We track performance using metrics such as F1 score and accuracy by logging our experimentation to MLflow. “Accuracy” in this example determines the level to which our supervised model can replicate the results of the LLM from approach 1 rather than replicating the “ground truth” (which we don’t have). By distilling the larger model’s knowledge into a smaller model, we unlock cheap, explainable inference with higher throughput than an LLM can provide. In particular, Spark ML’s native parallel processing runs machine learning models in a scalable way. Another benefit of using a traditional ML model is that your classification predictions are automatically provided as confidence levels, meaning we get more insights into data points worth investigating. The supervised predictions with low confidence levels are less likely to match the original LLM labels. Discrepancies between the supervised prediction, LLM unsupervised prediction, and confidence metrics – LLM confidence, semantic similarity, levenshtein distance, and supervised prediction confidence– are important data points to investigate and potentially to label manually.
We implement similarity search using Mosaic AI Vector Search as an alternative approach to classification. This approach provides another layer of explainability and can act as a sanity check on the supervised model, a primary classifier in high-variance domains, or a fallback for low-confidence predictions. Separately, this approach also enables semantic search across your corpus of documents. Search can be useful for other use cases which require real time search such as RAG. To implement the approach, we embed the source text into vectors using the databricks-gte-large-en foundation model. Next, we create a SQL Function in Unity Catalog that searches the index for the most similar texts, and returns the most common category among those top 10 neighbors. By identifying the categories of similar pieces of text from the past we can infer the category of the new text. If similar documents have different categories than those provided by approaches 1 and 2, we will have less confidence in the predictions and take the steps required to address the uncertainty.
Ultimately, although we’ve provided the three approaches separately in this demonstration, a hybrid setup may be the most effective way to classify your text. You can use the supervised NLP model for most classifications due to its speed and cost effectiveness, but when metrics like confidence_level (LLM output), semantic_similarity (LLM prediction vs. input), probability (supervised ML output), and similar_doc_count (vector neighbor consensus) are below some threshold, you can trigger the LLM or vector search function to classify those edge cases. Additionally, there could be a threshold that triggers a manual review. Focusing on less confident predictions allows you to continuously retrain more effective models with minimal expert time spent on labeling; only the nuanced examples require review. Overall, the ensemble approach can balance speed, accuracy, and cost, and can adapt as your dataset evolves.
In summary, we leveraged an off-the-shelf LLM using AI_QUERY for fast, simple bootstrapping of training data and labels. Next we trained more traditional NLP models for faster, cheaper predictions based on those labels. Finally, we built a vector search index for a third classification approach. In each method, we added confidence metrics to enable an ensemble approach and build a more effective feedback loop to improve the classification system over time. Ultimately, the flexibility of Databricks makes approaches which were traditionally siloed across different systems –AI_QUERY in SQL, traditional ML, and semantic search– unified and accessible to every persona across your business. By combining each of these approaches we’re able to build faster, cheaper, better systems to solve business problems without a significant investment into data labeling. For the end-to-end example, try the repo!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.