In the first part of this article, we explored the critical role that version control plays in data science, particularly when using Git within the Databricks environment. We covered the foundational concepts of Git, its integration with popular version control platforms, and how to set up and manage repositories in Databricks Repos.
With these basics in mind, the next step is to delve into best practices for using Git effectively in data science projects. In this second part, we will focus on branching strategies, structuring your projects, and choosing a mono- vs multi-repository structure. Whether you're working solo or as part of a multidisciplinary team, adhering to these best practices will help you leverage Git's full potential to maintain clean, efficient, and robust workflows.
Now we understand the basics of managing changes to code inside Databricks, it’s time to take a look at how to make this work when several developers are collaborating on the same project.
Ask three software developers which Git workflow is ‘the best’ and you’ll receive four opinions. Your organisation might have a standard they want you to follow or you may have a member of your team who has tried and found success with a particular approach. If that’s not the case and you’re looking for a simple option that we see works well with our customers, you could do worse than adopting the ‘feature branch’ workflow.
In this scenario, each discrete fix, enhancement or additional feature added to the code is developed in its own branch. To translate to the data science world, these discrete changes might equate to e.g. adding new covariates into a model’s training data, employing a different performance metric for model evaluation or changing the code used for model inference to obfuscate details of requests during logging. The golden rule is to never allow developers to commit to the ‘main’ or ‘master’ branch of your repository.
NOTE A notable concern associated with feature branch workflow is having long-lived feature branches. If feature branches are not merged back into the main branch frequently, they can become outdated and more difficult to integrate. This can result in intricate merge conflicts when attempting to merge the feature branches back into the main branch. To mitigate this risk, it is crucial to establish clear guidelines for branching, such as implementing time constraints to ensure timely integration and minimise complexities in the merging process.
Let’s take a look at how we might build on the previous example in order to adapt it to fit this workflow. What you see described in this section is a very basic version of the suggested flow from our documentation (AWS | Azure | GCP) :
NOTE You can always switch between the different branches in your repository by coming back to the Git UI and choosing a different branch in the drop-down selector.
Make the necessary changes, commit and push to the new branch.
(post-approval, pre-merge)
(merge confirmation, we’ll discuss merge types shortly)
In the previous example, we created a feature branch from main, made some small changes and quickly merged it back in. There were no changes to main while our feature branch existed and so there was no likelihood of us having so-called ‘merge conflicts’ blocking our ability to merge our PR.
If, however, another of our colleagues had also created a feature branch on main and has been working on it while you were busy merging your changes back into main then, should you both have changed the same file(s), he or she will need to resolve any merge conflicts before their changes can be merged into main.
As a side note, your choice of ‘merge’ vs ‘rebase’ really depends on how you want the commit history to look after you’ve finished your work. There’s a decent explanation here if you’re curious to know more.
A question that we are frequently asked by Databricks customers is: how should we structure our repositories to enable streamlining and standardisation of our projects while retaining sufficient flexibility to cater for different flavours of project and assets?
If you are coming to this problem from the laptop-based data science workflows, you might be familiar with project templates such as Cookiecutter Data Science (docs). These work very well in the context of local development. When trying to use these with Databricks, you may encounter some inconsistencies in the principles underlying these templates and the characteristics of the lakehouse, for example:
Instead, our recommendation is to use the MLOps Stacks (AWS | Azure | GCP) templating system available through the Databricks CLI. This set of templates goes beyond code for training models and running inference and contains Databricks Workflow definitions that map neatly to our recommended MLOps workflows and allow some degree of customisation over the compute configuration used to execute these tasks. The example code in the template also shows how to store and update features in the Databricks feature store.
Installation is guided, through our CLI, so you can configure the project to suit your needs and environment.
Once deployed, the repo contains instructions on how to configure all of the Databricks components, how to make everything work with your local IDE (if that’s how you want to do your development), explanations of the example code provided and instructions on how to set up CI/CD in your version control platform. Again, we’ll return to this topic in more detail in a later article.
Our customers sometimes ask whether it is better to keep all of the team’s code in a single repository or host a separate repository for each project.
In general our advice would be that, for small teams that don’t need:
you may find the monorepo approach beneficial as it allows sharing of common components (such as ‘utility’ code functions) across multiple projects.
You can always use the ‘sparse checkout’ functionality of our Repos feature to just mirror the code for your particular project from a monorepo inside the workspace ( AWS | Azure | GCP ).
This article builds on the foundational concepts of version control in data science, particularly within the Databricks environment, discussed in part one. It delves into Git best practices, focusing on the "feature branch" workflow, project structuring, and the decision between mono- and multi-repository setups. By following these guidelines, data science teams can ensure efficient collaboration, maintain clean codebases, and streamline workflows, ultimately enhancing the robustness and scalability of their projects.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.