Databricks Community

StuartLynn · ‎09-16-2024

In the first part of this article, we explored the critical role that version control plays in data science, particularly when using Git within the Databricks environment. We covered the foundational concepts of Git, its integration with popular version control platforms, and how to set up and manage repositories in Databricks Repos.

With these basics in mind, the next step is to delve into best practices for using Git effectively in data science projects. In this second part, we will focus on branching strategies, structuring your projects, and choosing a mono- vs multi-repository structure. Whether you're working solo or as part of a multidisciplinary team, adhering to these best practices will help you leverage Git's full potential to maintain clean, efficient, and robust workflows.

The ‘feature branch’ workflow

Now we understand the basics of managing changes to code inside Databricks, it’s time to take a look at how to make this work when several developers are collaborating on the same project.

Ask three software developers which Git workflow is ‘the best’ and you’ll receive four opinions. Your organisation might have a standard they want you to follow or you may have a member of your team who has tried and found success with a particular approach. If that’s not the case and you’re looking for a simple option that we see works well with our customers, you could do worse than adopting the ‘feature branch’ workflow.

In this scenario, each discrete fix, enhancement or additional feature added to the code is developed in its own branch. To translate to the data science world, these discrete changes might equate to e.g. adding new covariates into a model’s training data, employing a different performance metric for model evaluation or changing the code used for model inference to obfuscate details of requests during logging. The golden rule is to never allow developers to commit to the ‘main’ or ‘master’ branch of your repository.

By using separate branches for all changes, it becomes easy for multiple developers to work on a particular feature without disturbing the main codebase and implies the main branch should never contain broken code.
When it comes to applying the changes in a feature branch to ‘main’, this is achieved by raising a ‘pull request’ (more commonly, ‘PR’), having it reviewed by peers within your team and approved before finally ‘merging’ the updates into the ‘main’ branch.

NOTE A notable concern associated with feature branch workflow is having long-lived feature branches. If feature branches are not merged back into the main branch frequently, they can become outdated and more difficult to integrate. This can result in intricate merge conflicts when attempting to merge the feature branches back into the main branch. To mitigate this risk, it is crucial to establish clear guidelines for branching, such as implementing time constraints to ensure timely integration and minimise complexities in the merging process.

Let’s take a look at how we might build on the previous example in order to adapt it to fit this workflow. What you see described in this section is a very basic version of the suggested flow from our documentation (AWS | Azure | GCP) :

Create a new branch

When we clone a new repository, by default we see and can interact with the contents of the ‘main’ or ‘master’ branch.

Clicking on the branch name (or ‘Git…’ under the kebab menu) takes us to the Git UI where we have the option of creating a new branch:

Naming a branch in a sufficiently concise but descriptive fashion is often tricky, but you could do worse than following the advice here.
In the example above, we are creating a branch based on the main branch. There’s no reason why you couldn’t branch from another feature branch to include all of the changes made there as your starting point.

NOTE You can always switch between the different branches in your repository by coming back to the Git UI and choosing a different branch in the drop-down selector.

Make your changes

Make the necessary changes, commit and push to the new branch.

Raise a PR within your version control platform

Find the option to create a pull request from your new branch, e.g. for Azure DevOps it should appear in the ‘Files’ page for your repository:

Unless you are branching from a branch, the PR will most likely be raised against the ‘main’ branch (do check this before creating the PR!) It is good practice to provide as much detail as your selected reviewer(s) will need in order to figure out what it is you’re trying to achieve with this update.

Create the PR and ask your colleagues nicely if they can review it as soon as they’re able to.

Reviewing a PR and merging

If you are a reviewer on a colleague’s PR, the most helpful feature of the version control platform will be the diff viewer:

You can ask questions and leave your colleague(s) feedback on specific parts of the code.

In response to your comments, your colleague(s) can push further changes to the branch and this will update the PR view.
Once your review is complete and you’re happy with the changes, you can approve the changes, merge them into the target branch (‘main’ in this example) and close the PR.

(post-approval, pre-merge)

(merge confirmation, we’ll discuss merge types shortly)

Your main branch should now include the changes and its history will include all of the commits that you made in the feature branch:

Pulling in changes from other branches and resolving merge conflicts

In the previous example, we created a feature branch from main, made some small changes and quickly merged it back in. There were no changes to main while our feature branch existed and so there was no likelihood of us having so-called ‘merge conflicts’ blocking our ability to merge our PR.

If, however, another of our colleagues had also created a feature branch on main and has been working on it while you were busy merging your changes back into main then, should you both have changed the same file(s), he or she will need to resolve any merge conflicts before their changes can be merged into main.

Just as before, we commit and push the new branch’s changes from Databricks:
And raise a PR to merge it into main.
Now we are told there is a merge conflict in our code and this will prevent us being able to merge it into main.

How do we resolve this? We need to merge in the ‘new’ changes that have been committed to main, ensure our new code works with both sets of changes, then repeat the commit-and-push process. We can achieve that first step (pulling in the changes from main) by selecting ‘Merge’ from the kebab menu in the workspace Git UI:

As a side note, your choice of ‘merge’ vs ‘rebase’ really depends on how you want the commit history to look after you’ve finished your work. There’s a decent explanation here if you’re curious to know more.

Select the main branch to bring into your feature branch and hit ‘Begin merge’.

Once the merge operation has completed, a message will appear in the Git UI that explains what your options are for resolving the conflicts before you commit and push.

Files with conflicts that need to be resolved will be annotated with an exclamation mark in the ‘Changes’ list, while in cases where the ‘other’ history has added a new file, you’ll see an ‘A’ next to the file instead:

Usually, you’ll want to manually reconcile the changes to a conflicting file using the editor, e.g:

When you are happy, hit ‘mark as resolved’ and repeat the process until you have no more conflict annotations in the ‘Changes’ list.
You should now be able to click on ‘Continue Merge’ to complete the process and push the new code to the remote repository.

If you refresh your PR in the version control platform, it should now allow you to merge into the target branch.

Your main branch now includes the changes from both of the feature branches and if you look at the commit history, you’ll see an extra ‘merge’ node that represents the action we just performed (the second from the top in this example):

Structuring your projects

A question that we are frequently asked by Databricks customers is: how should we structure our repositories to enable streamlining and standardisation of our projects while retaining sufficient flexibility to cater for different flavours of project and assets?

If you are coming to this problem from the laptop-based data science workflows, you might be familiar with project templates such as Cookiecutter Data Science (docs). These work very well in the context of local development. When trying to use these with Databricks, you may encounter some inconsistencies in the principles underlying these templates and the characteristics of the lakehouse, for example:

These kinds of templates often propose keeping model training and evaluation data together with the code in the same repository. In the lakehouse, Unity Catalog is the home for all data: raw, transformed or reference. And this is for good reason, since datasets should have the appropriate governance and controls applied to them at all times and, in any case, are unlikely to be small enough to be mirrored into a VCS repository.
They often suggest storing serialised trained model objects with the project source code. Again, this runs contrary to the approach we recommend in Databricks where models should be serialised by, and accessible through, the MLflow tracking server and Unity Catalog. This allows us to manage different versions of the same model and map models back to the code, hyperparameters, data and environment with which they were trained.

Instead, our recommendation is to use the MLOps Stacks (AWS | Azure | GCP) templating system available through the Databricks CLI. This set of templates goes beyond code for training models and running inference and contains Databricks Workflow definitions that map neatly to our recommended MLOps workflows and allow some degree of customisation over the compute configuration used to execute these tasks. The example code in the template also shows how to store and update features in the Databricks feature store.

Installation is guided, through our CLI, so you can configure the project to suit your needs and environment.

Once deployed, the repo contains instructions on how to configure all of the Databricks components, how to make everything work with your local IDE (if that’s how you want to do your development), explanations of the example code provided and instructions on how to set up CI/CD in your version control platform. Again, we’ll return to this topic in more detail in a later article.

Choosing a mono- vs. multi-repo structure

Our customers sometimes ask whether it is better to keep all of the team’s code in a single repository or host a separate repository for each project.

In general our advice would be that, for small teams that don’t need:

isolation between different projects (or your organisation demands it); or
different workflows for automation and deployment of your application

you may find the monorepo approach beneficial as it allows sharing of common components (such as ‘utility’ code functions) across multiple projects.

You can always use the ‘sparse checkout’ functionality of our Repos feature to just mirror the code for your particular project from a monorepo inside the workspace ( AWS | Azure | GCP ).

Summary

This article builds on the foundational concepts of version control in data science, particularly within the Databricks environment, discussed in part one. It delves into Git best practices, focusing on the "feature branch" workflow, project structuring, and the decision between mono- and multi-repository setups. By following these guidelines, data science teams can ensure efficient collaboration, maintain clean codebases, and streamline workflows, ultimately enhancing the robustness and scalability of their projects.

Coming up next!

Next blog in this series: MLOps-Gym - Databricks Feature Store - Part Two

Databricks Community

MLOps Gym - Version control best practices

The ‘feature branch’ workflow

Create a new branch

Make your changes

Raise a PR within your version control platform

Reviewing a PR and merging

Pulling in changes from other branches and resolving merge conflicts

Structuring your projects

Choosing a mono- vs. multi-repo structure

Summary

Coming up next!

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks