cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Git credentials for service principals running Jobs

camilo_s
Contributor

I know the documentation for setting up Git credentials for Service Principals: you have to use a PAT from your Git provider, which is inevitably tied to an user and has a lifecycle of its own.

Doesn't this kind of defeats the purpose of running a job as a service principal, or is it just me?

In principle, running a job as a service principal would help decoupling it from a specific user and also centralizing the permissions the service principal may require to run its job.

But in the case where a job fetches code from Git (best practice), there's inevitably still a coupling to a user at Git provider level. You not only face the risk of the user leaving the org and all workflows failing, but you have to manage the PATs lifecycle and rotation. 

This becomes a nightmare if you want a moderately granular assignment of service principals to workflow (e.g. if you run a data mesh where you'd like each data product to have its own service principal to run the data product's workflows).

Are there any perspectives on Databricks' roadmap of adding support for OIDC authentication? Several Git providers (e.g. GitHub, Azure DevOps) are capable of workload identity federation, so I wonder why Databricks hasn't caught up and implemented OIDC to leverage this.

It would greatly improve the platform experience of Databricks.

7 REPLIES 7

delonb2
New Contributor III

 

I recommend avoiding fetching from a Git provider directly to run code during workflows and instead have a task that updates a Git folder within your workspace during the job (article with more details below). That way you can use Databricks to manage permissions to users and service principals and achieve granular isolation that is all within the platform and is easily traceable. For enterprises with on-prem Git data, this also avoids jobs failing due to the Git proxy server being down. The resource below sets up a great solution but one that is simpler is just having the job update the Git folder with retries whenever it is scheduled.

CI/CD techniques with Git and Databricks Git folders (Repos) | Databricks on AWS

Update: Forgot to add originally, but these are the non-PAT auth solutions Configure Git credentials & connect a remote repo to Databricks | Databricks on AWS

We do support both job code source options in our platform: fetch from Git provider and fetch from Git folder.

Our release process is driven by tags though (like described here), which is why we prefer to use direct fetch from the Git provider: you can pin the Git tag in a released workflow's definition and it transparently shows everywhere.

In principle, this process also works for Git folder-sourced workflows: we update Git folders from a Git tag as part of our CI/CD process. Unfortunately, this has the major shortcoming that the Databricks UI doesn't display the Git tag's name next to the repository, just uninformative detached... (see screenshot attached). This is why we tend to prefer Git-sourced code.

Screenshot 2024-06-18 at 09.45.53.png

In my opinion, if the REST API admits a tag parameter, it should honor it by propagating it through to the UI 😞 I guess this is currently impossible because the Get a repo endpoint doesn't provide a tag attribute in its response the UI could get the tag from (I don't think it should be a major milestone to implement FWIW)

I love Databricks but sometimes have the opinion it should divert a bit more resources to improve the developer experience, so developers can easily implement their current good practices rather than having to force-mold them around Databricks' limitations.

On a side note: existing non-PAT solutions only work for users and not for service principals.

-werners-
Esteemed Contributor III

@camilo_s wrote:

I love Databricks but sometimes have the opinion it should divert a bit more resources to improve the developer experience, so developers can easily implement their current good practices rather than having to force-mold them around Databricks' limitations.


Agree!

nicole_lu_PM
Contributor III

Hi Camilo,

Thank you for the thoughtful feedback. Our implementation does not currently not limit what type of Git provider identity the user chooses to use with the Databricks SP. Git providers have different support and recommendation for service accounts.

  • Azure DevOps: We recently added documentations and published a blog about connecting to Azure DevOps using Entra ID credentials for service principals. This should avoid storing a user's PAT token if you connect to Azure DevOps. https://docs.databricks.com/en/dev-tools/ci-cd/use-ms-entra-sp-with-devops.html
  • Github: does not differentiate accounts used by humans or machine users. Users can create machine user accounts to automate continuous integration (CI) workflows. We do not currently offer OAuth connectivity for SP via Github apps but it has been discussed internally.
  • Gitlab: Gitlab supports service accounts that are different from user accounts, but does not support OAuth authentication for Service Accounts. We currently do not support OAuth connectivity for Gitlab. 
  • Bitbucket: does not support service accounts . Bitbucket recommends to use a regular account with an email address as a "service account", to use SSH access keys, or to use Repository Access TokensWe currently do not support OAuth connectivity for Bitbucket.  
  • AWS CodeCommit: does not differentiate between humans and applications; both authenticate as IAM usersWe currently do not support OAuth connectivity for CodeCommit and it's being deprecated.  

Which Git provider does your organization use? 

 

Thanks for your reply @nicole_lu_PM,

We use Azure DevOps and Microsoft Entra ID service principals for automation. In our CI/CD we're able to use service principal authentication via the Azure CLI to interact with Databricks workspaces via the Databricks-SDK/CLI and we have no pain points with this process.

If I understand the documentation and blog post correctly, the authentication there refers to the connection Databricks Git folder <-> Git provider via a Git credential for an Entra ID SP. We currently do this essentially as described in those documents, with one difference: you don't really need to create an additional client secret for the Entra ID SP in the Databricks account because the Entra ID client secret generated in Step 1 of the documentation suffices (both the SDK and CLI can pick the authentication context of az login and request the Entra ID tokens in the background).

Our pain point and where I believe Databricks' platform ergonomics is just still unripe is in the connection Job running as service principal <-> Git provider when you use version-controlled source code in a Databricks job.

  • If you trigger job runs via API request, you can add a step before to fetch a fresh Entra ID token and update the SPs Git credential with it just in time for the job run (which will check the code from the upstream repo).
  • But what if you'd like to let Databricks orchestrate your job runs by triggering them on a schedule? There's no way to manually add a token refresh before job run there and even if it was possible, the real solution would be for Databricks to support machine to machine OAuth from Databricks to Git providers (I'm well aware that m2m is supported towards Databricks).

Identity providers do support such use-case as well (Microsoft Entra ID with Workload Identity Federation, GitHub with OAuth Apps) and I'll be happy when Databricks supports it and we as platform engineers can get rid of the workarounds we need to make up for it.

Thank you for your great feedback camilo_s. We acknowledge that the EntraID service principal git cred user journey is cumbersome, especially when you try to use it with a git job. I agree that the best approach is for the product to build a non-PAT, OAuth-based integration that works for SP. 

For now, we are working internally to produce a sample for getting an EntraId SP git cred to work in a git job like this: 

* Make the Entra flow the first task in a job, and make the original job the second task

* In the Entra flow, execute the following code: however, instead of using read for inputs, store all the inputs in a secret scope and use them to get a fresh EntraId token for databricks git-credentials update

camilo_s
Contributor

That would be a valid workaround (with the caveat that if a job's tasks run long enough, the token may expire). I also agree that this authentication flow should be ideally provided by Databricks as a feature, given that likely many customers face the same challenge when relying on jobs that run as service principals while sourcing code from a Git provider directly.

Thanks are due to you @nicole_lu_PM for engaging in the discussion and honestly caring for customers!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group