cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Git credentials for service principals running Jobs

camilo_s
Contributor

I know the documentation for setting up Git credentials for Service Principals: you have to use a PAT from your Git provider, which is inevitably tied to an user and has a lifecycle of its own.

Doesn't this kind of defeats the purpose of running a job as a service principal, or is it just me?

In principle, running a job as a service principal would help decoupling it from a specific user and also centralizing the permissions the service principal may require to run its job.

But in the case where a job fetches code from Git (best practice), there's inevitably still a coupling to a user at Git provider level. You not only face the risk of the user leaving the org and all workflows failing, but you have to manage the PATs lifecycle and rotation. 

This becomes a nightmare if you want a moderately granular assignment of service principals to workflow (e.g. if you run a data mesh where you'd like each data product to have its own service principal to run the data product's workflows).

Are there any perspectives on Databricks' roadmap of adding support for OIDC authentication? Several Git providers (e.g. GitHub, Azure DevOps) are capable of workload identity federation, so I wonder why Databricks hasn't caught up and implemented OIDC to leverage this.

It would greatly improve the platform experience of Databricks.

3 REPLIES 3

delonb2
New Contributor III

 

I recommend avoiding fetching from a Git provider directly to run code during workflows and instead have a task that updates a Git folder within your workspace during the job (article with more details below). That way you can use Databricks to manage permissions to users and service principals and achieve granular isolation that is all within the platform and is easily traceable. For enterprises with on-prem Git data, this also avoids jobs failing due to the Git proxy server being down. The resource below sets up a great solution but one that is simpler is just having the job update the Git folder with retries whenever it is scheduled.

CI/CD techniques with Git and Databricks Git folders (Repos) | Databricks on AWS

Update: Forgot to add originally, but these are the non-PAT auth solutions Configure Git credentials & connect a remote repo to Databricks | Databricks on AWS

We do support both job code source options in our platform: fetch from Git provider and fetch from Git folder.

Our release process is driven by tags though (like described here), which is why we prefer to use direct fetch from the Git provider: you can pin the Git tag in a released workflow's definition and it transparently shows everywhere.

In principle, this process also works for Git folder-sourced workflows: we update Git folders from a Git tag as part of our CI/CD process. Unfortunately, this has the major shortcoming that the Databricks UI doesn't display the Git tag's name next to the repository, just uninformative detached... (see screenshot attached). This is why we tend to prefer Git-sourced code.

Screenshot 2024-06-18 at 09.45.53.png

In my opinion, if the REST API admits a tag parameter, it should honor it by propagating it through to the UI 😞 I guess this is currently impossible because the Get a repo endpoint doesn't provide a tag attribute in its response the UI could get the tag from (I don't think it should be a major milestone to implement FWIW)

I love Databricks but sometimes have the opinion it should divert a bit more resources to improve the developer experience, so developers can easily implement their current good practices rather than having to force-mold them around Databricks' limitations.

On a side note: existing non-PAT solutions only work for users and not for service principals.

-werners-
Esteemed Contributor III

@camilo_s wrote:

I love Databricks but sometimes have the opinion it should divert a bit more resources to improve the developer experience, so developers can easily implement their current good practices rather than having to force-mold them around Databricks' limitations.


Agree!

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!