In the halcyon days of data science’s youth, version control was an oft-overlooked aspect of the work of data science teams, a “nice-to-have” that was perhaps the domain of hobbyists and enthusiasts. Those coming to data science from a software development background would curse the lack of attention to the subject from their peers, and rightly so. How can you work efficiently in a multidisciplinary team if you can’t divide work up between individuals and work in parallel on the source code of your projects and applications? How do you ensure reproducibility of results without having an audit trail of a project’s code, the data, the configuration and the environment in which the code is executed?
Today, it is easier than ever to adopt the best practices of software development without introducing an unreasonable burden on data scientists and this is especially true for users of the Databricks Lakehouse.
If you intend to do anything more than once-and-done analysis tasks and harbour ambitions of building data-centric applications with embedded machine learning components then you will need to build this capability within your team and this article will show you how.
After a brief battle for prominence in the 2000s, the distributed version control model of Git won out over contemporaries like Subversion and Mercurial as the primary tool for managing the source code of large projects, especially large, complex, open-source projects. Today, it is the de facto version control system of choice and all of the version control capabilities within Databricks sit atop Git.
Git’s distributed design allows contributors to work on their code locally and provides flexibility as to when they ‘push’ their changes to a remote “repository” (and this is, in fact, exactly how the process works within Databricks as we shall see).
Let’s cover some of the nomenclature that is often associated with working with Git:
We’ll cover what branches represent and branching strategy later in part 2 of this article but, at a very simplistic level:
Git contains a number of tools and techniques for combining branches which we’ll explore in more detail in this guide.
A quick note on nomenclature: while ‘main’ and ‘master’ are used interchangeably in the context of Git branches, it is generally accepted that ‘master’ has an undesirable subtext and that ‘main’ should be preferred where possible. Most version control platforms have some option to set the default name, for example in Azure DevOps it can be found in the ‘Project settings’ / ‘All repositories settings’ menu:
Most people in the data science world will have heard of Github, even if they’re not entirely sure what purpose it fulfils (perhaps beyond making useful tools available for download and use in your own project) or how Git and Github differ. Simplistically: where Git is a version control system, Github is perhaps better described as a version control platform. It is a place where teams can host their remote repositories and provides the APIs needed for using Git to push your local changes up to the remote.
It is worth adding, however, that the capabilities of platforms such as Github go far beyond simple hosting of repositories, for example enabling efficient code review by producing summarised comparison views between different version of the codebase (known colloquially as ‘diffs’) and providing the ability to run automation scripts to build the application and execute tests whenever a user commits and pushes their changes.
And of course, Github is not the only such version control platform, there are many just like it. Some are cloud-based, providing the maximum possible degree of collaboration and accessibility, including cross-organisation or even public collaboration. Some organisations, on the other hand, prefer their source code to be managed within a much tighter trust zone and so will self-host their own version control platform.
Databricks provides support for all the following platforms:
Those in the list suffixed with an asterisk are typically deployed on-premise.
In 2021, Databricks made the ‘Repos’ feature generally available to all users of the platform. Longer-tenured Databricks users will perhaps recall what a great leap forward this was from what was possible before (a very clunky workflow for syncing individual notebooks with a Git repository).
The Repos feature has subsequently been enhanced and renamed ‘Git Folders’, with the biggest material change being the ability to store a Repo / Git Folder anywhere within the Databricks Workspace (rather than just at the /Repos folder). You may hear both terms used interchangeably by other Databricks users.
With Databricks Git Folders, Databricks does not itself become a version control platform, rather this functionality positions the workspace as a home for ‘local’ versions of your repositories, with the remotes remaining in your chosen version control platform.
Users clone a remote repository into their workspace, add notebooks, scripts etc. or make changes to the version of the code stored there before committing these changes and pushing them back to the remote.
Because of the way the Git Folders feature has been implemented in Databricks, there is also the possibility of cloning the remote repository to your local machine and working on it there using one of the patterns described in our docs section: ‘Use IDEs with Databricks’ (AWS | Azure | GCP). If changes to the code then need to be synchronised back into the workspace, this can be performed by pushing your local changes to the remote repository, then refreshing the version in the workspace by ‘pulling’ in the latest version of the code.
If that’s of interest to you, be sure to check out our guide on setting up your IDE also in the MLOps Gym.
Here we shall present a workflow covering the very simplest scenario: a single user starting a new project where they intend to work on their own and who wishes to persist their codebase in a remote repository in Azure DevOps.
The first step is to link your Azure DevOps and Databricks accounts using the ‘linked accounts’ UI inside the Databricks workspace:
1. Log into your Azure DevOps organisation by selecting the appropriate organisation from the list that’s shown here: https://aex.dev.azure.com/me
2. Create a new project.
3. From the new repo’s main page, find the URL for the repo under ‘Repos’ / ‘Files’ and copy it to the clipboard.
1. Select either:
2. Paste the URL into the form box marked ‘Git repository URL’ and hit ‘Create Repo’. You should now see the repo in your workspace:
1. If you enter the Git Folder / Repo, you should now be able to create objects inside the repository:
2. Create a notebook and write some simple code inside:
1. From the Git Folder / Repo kebab menu, select ‘Git…’ to access Git operations.
2. This will show us a summary of the changes made (referred to as a ‘diff’), allow you to select the files to be included in the commit and add a message to the commit.
3. Clicking ‘Commit & Push’ now will trigger the changes to be propagated to the ‘master’ branch of the remote repository in Azure DevOps.
4. The Repos view in Azure DevOps will hopefully now look different to its prior state:
Note on the frequency of these actions:
The steps above will vary slightly depending on your organisation’s choice of version control platform, especially the parts concerning linking of accounts. All the options are documented in our public docs (AWS | Azure | GCP).
The article discusses the evolution of version control in data science, emphasizing its importance for efficient teamwork, reproducibility, and maintaining an audit trail of project code, data, configuration, and execution environments. Initially overlooked in data science, version control, particularly using Git, is now essential for teams aiming to build robust data-centric applications. The article explains Git's role as the primary version control system and describes how it integrates with platforms like GitHub and Azure DevOps within Databricks. It also provides a practical guide for setting up and using Databricks Repos for version control, highlighting the steps for linking accounts, creating repositories, and managing code changes.
In part 2 of this article, we will discuss best practices when setting up version control.
Next blog in this series: MLOps Gym - Version Control - Part 2: Best Practices
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.