Authors: Ellen Hirt, Narjes Majdoub, Giran Moodley
Contents
Introduction
Large Language models are powerful, but they are usually trained on large and varied datasets. This makes them highly versatile but not necessarily tailored to specific tasks or domains. Various techniques have emerged to extend the capabilities of these models by adapting their outputs to downstream applications. The typical flow usually starts from testing several prompt engineering methodologies to more complex and compute-intensive methods such as pre-training, as explained in the figure below. To bridge the complexity and cost gap, a technique called “fine-tuning” becomes essential.
Image 1. Generative AI architecture patterns
Fine-tuning is the process of adapting a pre-trained model to a particular task or dataset by further training it on smaller, task-specific data allowing it to adapt its knowledge to the nuances and requirements of the new context.
Fine-tuning language models is important for several reasons:
- Domain Adaptation: It allows the model to specialize in a specific domain, improving accuracy and relevance when generating content related to that field.
- Efficiency: Fine-tuning is more computationally efficient than training a model from scratch, as it leverages the existing knowledge of a pre-trained model.
- Resource Optimization: By focusing on specific tasks, fine-tuning can reduce the need for extensive data and computational resources. This approach can lead to smaller yet efficient models for the task at hand and shorter prompts, as lengthy instructions with many examples are no longer needed. This, in turn, can lower costs and improve latency, making it accessible for applications with limited resources.
- Performance Enhancement for Specific Tasks: Fine-tuning can significantly enhance the performance of generative models by aligning them more closely with the specific goals and data characteristics of a given application such as entity extraction or classification, thus providing more granular control over the model’s outputs and reducing specific biases and enforcing correctness.
Fine-tuning particularly makes sense in the following scenarios:
- When you have attempted prompt engineering and RAG techniques and the quality was not satisfactory for your use case.
- When cost and latency are important: if a large model is good enough, but you need faster and cheaper inference or a smaller model that fits in the memory of your edge device, it can be an option to fine-tune a smaller model. Moreover, if your prompts start getting very long, shorter prompts may lead to better performance along with lower latency and cost.
- When you need to include domain-specific or proprietary data from your company: maybe you need to generate text in a style that adheres to specific guidelines, or you need to teach your LLM some domain-specific content (for instance in the automotive domain, “Golf” may have a very different meaning than in sports).
- Your model needs to follow specific instructions or tasks, such as code generation, named entity recognition (NER) or classification.
However, while fine-tuning is a great approach to improving model results, it may not always be appropriate, particularly in the following cases:
- When a sufficiently large out-of-the-box model performs well enough without further customization and meets your latency requirements, fine-tuning may not be necessary. This is often the case for general tasks such as basic text summarization, translation, or generic Q&A. Similarly if your tasks deviate only marginally from those of the out-of-the-box model, fine-tuning may not be the best approach.
- If you have a limited amount of data and it doesn’t make sense to use synthetic data for your use case, fine-tuning may not be a feasible approach. In cases where your LLM requires access to frequently changing external knowledge in order to provide relevant outputs, RAG can be a better option. It can also be an addition to fine-tuning if needed.
Even though fine-tuning is a useful approach in many different scenarios, it can be complex to implement and scale, and even less so if you are just getting started. Therefore, in this article we provide some hands-on starting points and best practices to start your journey using Mosaic AI model training on Databricks.
The goal of this product is to make your fine-tuning journey easier and more accessible by:
- Providing an instant, secure and managed compute environment that dynamically scales with your workload.
- Simplifying your tech stack/dependencies by providing an easy-to-use API and UI.
- Facilitating your fine-tuned model serving by being integrated with MLflow Model Registry and Databricks model serving
- Optimal, default selection of hyperparameters with few exposed for sweeping
- Providing more control and ownership over the resulting model as it will be registered to Unity Catalog and thus you can control the permissions and access lineage information.
Image 2. Mosaic AI model training product overview
After reading this article, you will:
- Know when fine-tuning might be a good approach and avoid common pitfalls
- Know how to get started with fine-tuning using Mosaic AI Model Training on Databricks
- Understand which fine-tuning approaches for language models you can leverage on Databricks
- Understand how you can prepare and manage your data for fine-tuning on Databricks
Section 1: Preparing the Environment
Using Mosaic AI Model Training on Databricks:
At the time of writing, fine-tuning on Databricks (via Mosaic AI Model Training product) is in public preview in the following regions. Mosaic AI Model Training allows you to tune or further train a foundation model using either the Databricks API or the UI. After ensuring that your workspace fulfills the requirements, you can get started to prepare your data for fine-tuning.
Image 3. Starting a fine-tuning experiment from the UI
Preparing your data for fine-tuning
Fine-tuning requires a sufficient amount of high-quality data that is specific to the task or domain you’re targeting. The data should be in a consistent format, aligned with the pretrained model’s input expectations. For instance, for instruction fine-tuning (IFT) on Databricks, your data needs to be in a .jsonl format, with a column named “prompt” for input, and “response” for targets (labeled data). For continued pre-training (CPT), you can read your training data from a .txt file in a Unity Catalog volume. You can find examples of the expected fine-tuning data formats for the different types of fine-tuning here, as well as a data validation notebook. Apart from reading from a volume, you can use data from a Delta table and public Hugging Face datasets directly too.
If you lack enough real-world or labeled data for your use case, you can:
- Annotate your own data, i.e. tag specific elements that are relevant for your objective. Depending on the complexity of the task and the resources you have available, this can be achieved manually by human experts, semi-automatically by adding a human-in-the-loop to supervise the automatically generated labels or fully automatically using machine learning techniques. You can also use inference tables on Databricks to collect feedback from your existing LLM applications.
- Generate synthetic data - while there are various ways to do this, you can find an example here
Use Databricks inference tables that automatically collect requests and responses from your model serving endpoints in a delta table. Combine this with human or automated feedback to curate a new dataset that can be used to further improve your mode. For example, data from your RAG application can be used to further fine-tune your customized models.
Section 2: Step-by-Step Instructions
After you have decided that fine-tuning is the way to go and prepared your environment, you can start fine-tuning on Databricks with the following steps:
- Data preparation
- Model initialisation and training setup
- Full experiment run and checkpointing
- Evaluation
- Deployment & monitoring
Image 4. Fine-tuning end-to-end workflow on Databricks
1. Data preparation
Depending on the type of fine-tuning experiment that you choose, you will need to ensure that your data is in the right format. When your data is stored in a table in Unity Catalog, you can then conveniently load it directly for your fine-tuning run (see below example from the docs). Your data needs to be provided in .txt or .jsonl format, with columns “prompt” & “response” for Instruction Fine-tuning (IFT) or Chat completion, and plain text data for Continued Pre-Training (CPT).
Image 5. When creating your training run, you can specify your paths in Unity Catalog
Apart from the format, the quantity of your data matters: for chat completion and instruction fine-tuning you should have at least several thousand prompt-response pairs, for continued pre-training you need a larger amount (more details here).
As with classical ML, the “garbage in garbage out” principle also applies here: even if you have sufficient data, your data quality matters as well. You may have a considerable volume of data that is skewed, redundant or irrelevant, which will lead to worse fine-tuning results.
If you are unsure whether you have (good) enough training data, reach out to your Databricks representative to receive advice on possible mitigations.
2. Model initialisation and training setup
Before starting your fine-tuning run, it is important to set your configuration. Some important parameters include:
- Base model: the name of the base model to use for fine-tuning. This can be e.g., LLama, Mixtral, and others. You can also load custom model weights by setting the custom weights path.
- Train data path and Unity Catalog schema for registering the model: this specifies your training data path in Unity Catalog, as well as where the final model will be stored.
- Training duration and learning rate: this sets the number of epochs, as well as the learning rate of your experiment.
Tip: start with a smaller number of epochs to check that everything is working well before starting the full run.
- Task type: choose from chat completion, instruction fine-tuning, or continued pre-training.
- Context length: the maximum length of your input data samples. This is either 8192 tokens or the maximum context length for the provided model. If you have long documents, make sure to properly chunk or truncate your data. If you have short snippets, consider pulling them together for efficiency.
Image 6. Configuring parameters when creating a run
For more possible parameter settings, please refer to the following table in the documentation and contact your Databricks representative if you require advice on optimal settings. At the time of writing, the Mosaic AI Model Training Sweeps Private Preview has launched, which enables you to automatically compare multiple models in parallel using recommended hyperparameters and including evaluation and tear-down of provisioned deployments.
3. Full experiment run and checkpointing
Once you start your fine-tuning run, Databricks will take care of experiment tracking, GPU allocation and storing the checkpoints to your specified Unity Catalog location. Some metrics that will be tracked within your MLflow experiment include loss, token accuracy, language perplexity and the learning rate (which is adaptive). Furthermore, you can keep track of the estimated remaining time and the number of tokens. During the run, your model’s parameters and events are also logged. Furthermore, checkpoints are stored during the run.
Image 7. Experiment Overview (MLflow run)
Image 8. Metrics, Parameters and Logs in MLFlow Experiment Run
4. Evaluation
In order to evaluate how good your fine-tuned model is performing, there are several options: First, it is always helpful to perform a vibe check: manually send some samples to the LLM and check if the results are as expected.
On top of that, Databricks offers the option to evaluate your LLM metrics quantitatively using labeled data and LLM-as-a-judge and MLflow.evaluate. In order to do so, you should establish a baseline metric on your base model before fine-tuning to compare to the fine-tuned models. Which metric to use depends on your use case (e.g., for summarization you will use a different metric than for classification). When you want to compare multiple models, you can compare the different runs within your experiment in MLflow.
Image 9. Model metrics tracked during fine-tuning run
5. Deployment and Monitoring
Finally, you can deploy your fine-tuned model registered in Unity Catalog to Mosaic AI Model Serving. You can then leverage your model for batch inference or real-time serving. Databricks Model Serving allows you to use inference tables, which will capture your users’ requests and model responses in a Delta table. This can be used for collecting further potential fine-tuning training data. In order to adapt to your user needs whilst saving costs, the endpoint scales down to zero when not in use.
On top of that, Mosaic AI Gateway allows you to implement safety, PII and payload checks, as well as rate limiting and permissions.
Image 10. Model deployment
Section 3: Troubleshooting and Tips
Fine-tuning large language models (LLMs) can present several challenges. Here are some of the most common ones:
Data Quality and Quantity:
- One of the primary challenges is obtaining sufficient high-quality, task-specific data. The success of fine-tuning often depends on the availability and quality of the training data. In some domains, acquiring enough labeled data for specific tasks can be difficult and time-consuming. Careful curation and augmentation of datasets may be needed to address this challenge.
Overfitting and Generalization:
- Overfitting is a significant concern when fine-tuning LLMs. The model may perform exceptionally well on training data but struggle to generalize to new, unseen data. This can lead to poor performance in real-world applications. Techniques like hyperparameter tuning, regularization, using an evaluation dataset when applicable are crucial for identifying, preventing overfitting and promoting better generalization.
Computational Resources:
- Fine-tuning large models can be computationally expensive and resource-intensive, which can be challenging for practitioners with limited computational power. Techniques like model distillation and leveraging more efficient architectures (e.g., horizontal scaling on Databricks) can help address resource constraints.
Catastrophic Forgetting:
- The full fine-tuning process may lead to catastrophic forgetting, which occurs when the model loses its general language understanding capabilities while adapting to a specific task. Balancing the retention of general knowledge with task-specific adaptation is a key challenge.
Bias and Fairness:
- LLMs can inherit or amplify biases present in their training data. Fine-tuning on biased datasets can exacerbate this issue. Ensuring that the fine-tuning process reduces rather than reinforces biases is a significant challenge that requires careful consideration of data selection and model evaluation.
Evaluation and Validation:
- Properly evaluating the performance of fine-tuned models can be challenging. It's crucial to use appropriate metrics and test sets that accurately reflect the model's intended use case. Continuous monitoring and validation are necessary to ensure the model maintains its performance over time.
Alignment with Human Values:
- Ensuring that fine-tuned LLMs generate outputs aligned with human values and ethical considerations is a significant challenge. This involves careful prompt engineering and potentially using techniques like reinforcement learning from human feedback. Additionally, implementing guardrails in the input data for fine-tuning and on the evaluation output of the model can help reduce risks and assess the model’s alignment with desired ethical standards. These guardrails can include pre-defined ethical guidelines, bias detection mechanisms, and continuous monitoring systems. Furthermore, regular audits and updates based on real-world feedback are crucial to maintaining the model’s alignment with evolving human values and societal norms.
By understanding and addressing these challenges, practitioners can more effectively fine-tune LLMs for specific tasks and domains, leading to improved performance and more reliable AI applications.
Furthermore, we have collected some of the questions that we frequently encounter:
- Do I need to decide between RAG and fine-tuning?
- No, RAFT (Retrieval Augmented Fine-Tuning) can even be a very good complementary approach when you want to do RAG using a smaller, customized model.
- How large should my model be that I want to use for fine-tuning?
- This depends on your use case. Other parameters you should consider are the training duration vs. the size of the base model. One of the benefits of fine-tuning can be to adapt a smaller model with similar performance to your specific use case. You can refer to our LLM fine-tuning demo to get an idea of commonly used base models.
- How long should I train the model for?
- Again, this depends on your use case. Using MLflow Tracking you can monitor your fine-tuning progress during fine-tuning to see whether metrics are improving, such as loss and token accuracy.
- How much data is needed for fine-tuning?
- The amount of data needed for fine-tuning depends on the complexity of the task, the fine-tuning method, and the desired accuracy. Generally, a few thousand high-quality examples are sufficient, but more data can improve performance and robustness. The exact amount can vary based on the specific use case and model.
- How often do I need to perform fine-tuning?
- How accurate your model stays will depend largely on your use case, your data and how the model is being used (e.g., can you expect user queries to change over time?). With Databricks Lakehouse Monitoring and AI/BI Dashboards, you can keep track of your model version, underlying data, queries and responses, and set alerts when the accuracy of responses starts to degrade.
Conclusion
Thank you very much for taking the time to read our blog, and we hope you have a clearer understanding on how to get started with fine-tuning on Databricks. Moreover, you should know when fine-tuning might be a good approach and avoid common pitfalls.
In addition, we also described how Databricks Mosaic AI Model Training allows an end-to-end, high-quality fine-tuning workflow with easy access to training data, fine-tuning runs, models, evaluation, and more, enabling users to leverage a fully-governed platform with everything they need.
If you are interested in learning more about fine-tuning use cases, you can read how enterprises are using Mosaic AI Model Training:
For anything beyond the scope of this blog, we recommend contacting your Databricks representatives who can arrange a deeper-dive where needed.