MLOps Gym - Getting started with version control - Databricks Community

Technical Blog

Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.

In the halcyon days of data science’s youth, version control was an oft-overlooked aspect of the work of data science teams, a “nice-to-have” that was perhaps the domain of hobbyists and enthusiasts. Those coming to data science from a software development background would curse the lack of attention to the subject from their peers, and rightly so. How can you work efficiently in a multidisciplinary team if you can’t divide work up between individuals and work in parallel on the source code of your projects and applications? How do you ensure reproducibility of results without having an audit trail of a project’s code, the data, the configuration and the environment in which the code is executed?

Motivation

Today, it is easier than ever to adopt the best practices of software development without introducing an unreasonable burden on data scientists and this is especially true for users of the Databricks Lakehouse.

If you intend to do anything more than once-and-done analysis tasks and harbour ambitions of building data-centric applications with embedded machine learning components then you will need to build this capability within your team and this article will show you how.

Git

After a brief battle for prominence in the 2000s, the distributed version control model of Git won out over contemporaries like Subversion and Mercurial as the primary tool for managing the source code of large projects, especially large, complex, open-source projects. Today, it is the de facto version control system of choice and all of the version control capabilities within Databricks sit atop Git.

Git’s distributed design allows contributors to work on their code locally and provides flexibility as to when they ‘push’ their changes to a remote “repository” (and this is, in fact, exactly how the process works within Databricks as we shall see).

Let’s cover some of the nomenclature that is often associated with working with Git:

A ‘repository’ is a single container for all of the code and other assets required to build and run an application. Superficially, a repository looks just like a file system, but under the covers it also contains the history of all changes made to the project.
The central golden source of truth for a codebase is the ‘remote’ repository, usually hosted within a version control platform such as Github. All changes made by individual developers are sent to this ‘remote’.
Developers will initially ‘clone’ the remote repository to some local development environment where they can then work on the code, making changes where necessary.
At some point, developers can ‘commit’ their changes to Git’s history. As a rule, commits are accompanied by an informative message summarising the changes made by the committer.
Commits are then ‘pushed’ to the remote repository at more-or-less frequent intervals where they now become visible to all other viewers of the remote.
Once a repository has been initially cloned to a local environment, it can be brought up to date by ‘fetching’ the change log from the remote. This action does not apply the latest changes to the current version of the code that the developer sees, in order to do that they must ‘pull’ a remote branch onto their local working branch. Which leads us to…
The repository may contain many different versions / variations of the project’s source code, arranged in ‘branches’ which are the product of different permutations of commits.

We’ll cover what branches represent and branching strategy later in part 2 of this article but, at a very simplistic level:

Developers can work on their own branch of the code base for the particular task they are trying to achieve (fixing a bug, implementing a new feature etc.)
When an application is ready for release, some or all of the branches can be merged into a ‘main’ or ‘master’ branch, bringing together the changes from multiple developers into some coherent final product.

Git contains a number of tools and techniques for combining branches which we’ll explore in more detail in this guide.

A quick note on nomenclature: while ‘main’ and ‘master’ are used interchangeably in the context of Git branches, it is generally accepted that ‘master’ has an undesirable subtext and that ‘main’ should be preferred where possible. Most version control platforms have some option to set the default name, for example in Azure DevOps it can be found in the ‘Project settings’ / ‘All repositories settings’ menu:

Source / version control platforms

Most people in the data science world will have heard of Github, even if they’re not entirely sure what purpose it fulfils (perhaps beyond making useful tools available for download and use in your own project) or how Git and Github differ. Simplistically: where Git is a version control system, Github is perhaps better described as a version control platform. It is a place where teams can host their remote repositories and provides the APIs needed for using Git to push your local changes up to the remote.

It is worth adding, however, that the capabilities of platforms such as Github go far beyond simple hosting of repositories, for example enabling efficient code review by producing summarised comparison views between different version of the codebase (known colloquially as ‘diffs’) and providing the ability to run automation scripts to build the application and execute tests whenever a user commits and pushes their changes.

And of course, Github is not the only such version control platform, there are many just like it. Some are cloud-based, providing the maximum possible degree of collaboration and accessibility, including cross-organisation or even public collaboration. Some organisations, on the other hand, prefer their source code to be managed within a much tighter trust zone and so will self-host their own version control platform.

Databricks provides support for all the following platforms:

GitHub, GitHub AE, and GitHub Enterprise Cloud
GitHub Enterprise Server*
Atlassian BitBucket Cloud
Atlassian BitBucket Server and Data Center*
GitLab and GitLab EE
GitLab Self-Managed*
Microsoft Azure DevOps (Azure Repos)
Microsoft Azure DevOps Server*
AWS CodeCommit

Those in the list suffixed with an asterisk are typically deployed on-premise.

Version control in Databricks

In 2021, Databricks made the ‘Repos’ feature generally available to all users of the platform. Longer-tenured Databricks users will perhaps recall what a great leap forward this was from what was possible before (a very clunky workflow for syncing individual notebooks with a Git repository).

The Repos feature has subsequently been enhanced and renamed ‘Git Folders’, with the biggest material change being the ability to store a Repo / Git Folder anywhere within the Databricks Workspace (rather than just at the /Repos folder). You may hear both terms used interchangeably by other Databricks users.

With Databricks Git Folders, Databricks does not itself become a version control platform, rather this functionality positions the workspace as a home for ‘local’ versions of your repositories, with the remotes remaining in your chosen version control platform.

Users clone a remote repository into their workspace, add notebooks, scripts etc. or make changes to the version of the code stored there before committing these changes and pushing them back to the remote.

Working on Databricks code in your local development environment

Because of the way the Git Folders feature has been implemented in Databricks, there is also the possibility of cloning the remote repository to your local machine and working on it there using one of the patterns described in our docs section: ‘Use IDEs with Databricks’ (AWS | Azure | GCP). If changes to the code then need to be synchronised back into the workspace, this can be performed by pushing your local changes to the remote repository, then refreshing the version in the workspace by ‘pulling’ in the latest version of the code.

If that’s of interest to you, be sure to check out our guide on setting up your IDE also in the MLOps Gym.

Getting started with Databricks Git Folders

Here we shall present a workflow covering the very simplest scenario: a single user starting a new project where they intend to work on their own and who wishes to persist their codebase in a remote repository in Azure DevOps.

Link accounts

The first step is to link your Azure DevOps and Databricks accounts using the ‘linked accounts’ UI inside the Databricks workspace:

From the ‘User’ menu, select ‘User Settings’
Choose ‘Linked accounts’
Under ‘Git provider’ select ‘Azure DevOps Services (Azure Active Directory)’ and hit save.

Create the remote repository

1. Log into your Azure DevOps organisation by selecting the appropriate organisation from the list that’s shown here: https://aex.dev.azure.com/me

2. Create a new project.

3. From the new repo’s main page, find the URL for the repo under ‘Repos’ / ‘Files’ and copy it to the clipboard.

Clone the remote into your Databricks Workspace

1. Select either:

‘Git Folder’ from the ‘Create’ drop-down menu within the workspace browser.

‘Repos’ in the ‘Workspace’ tab on the left-hand navigation bar and click ‘Add Repo’

2. Paste the URL into the form box marked ‘Git repository URL’ and hit ‘Create Repo’. You should now see the repo in your workspace:

Create notebooks or files in the local repository

1. If you enter the Git Folder / Repo, you should now be able to create objects inside the repository:

2. Create a notebook and write some simple code inside:

Commit and push changes

1. From the Git Folder / Repo kebab menu, select ‘Git…’ to access Git operations.

2. This will show us a summary of the changes made (referred to as a ‘diff’), allow you to select the files to be included in the commit and add a message to the commit.

3. Clicking ‘Commit & Push’ now will trigger the changes to be propagated to the ‘master’ branch of the remote repository in Azure DevOps.

4. The Repos view in Azure DevOps will hopefully now look different to its prior state:

Note on the frequency of these actions:

Linking your Databricks and Azure DevOps / other version control platform accounts only needs to be done once-per-platform;
Creating and cloning repositories is only performed at the start of each project; and
Committing and pushing changes is performed when you want the remote repository to reflect your latest changes (in anticipation of, for example, a code review, automated testing, application release etc.)

Other version control platforms

The steps above will vary slightly depending on your organisation’s choice of version control platform, especially the parts concerning linking of accounts. All the options are documented in our public docs (AWS | Azure | GCP).

Summary

The article discusses the evolution of version control in data science, emphasizing its importance for efficient teamwork, reproducibility, and maintaining an audit trail of project code, data, configuration, and execution environments. Initially overlooked in data science, version control, particularly using Git, is now essential for teams aiming to build robust data-centric applications. The article explains Git's role as the primary version control system and describes how it integrates with platforms like GitHub and Azure DevOps within Databricks. It also provides a practical guide for setting up and using Databricks Repos for version control, highlighting the steps for linking accounts, creating repositories, and managing code changes.

In part 2 of this article, we will discuss best practices when setting up version control.