โ06-13-2024 09:07 AM
I know the documentation for setting up Git credentials for Service Principals: you have to use a PAT from your Git provider, which is inevitably tied to an user and has a lifecycle of its own.
Doesn't this kind of defeats the purpose of running a job as a service principal, or is it just me?
In principle, running a job as a service principal would help decoupling it from a specific user and also centralizing the permissions the service principal may require to run its job.
But in the case where a job fetches code from Git (best practice), there's inevitably still a coupling to a user at Git provider level. You not only face the risk of the user leaving the org and all workflows failing, but you have to manage the PATs lifecycle and rotation.
This becomes a nightmare if you want a moderately granular assignment of service principals to workflow (e.g. if you run a data mesh where you'd like each data product to have its own service principal to run the data product's workflows).
Are there any perspectives on Databricks' roadmap of adding support for OIDC authentication? Several Git providers (e.g. GitHub, Azure DevOps) are capable of workload identity federation, so I wonder why Databricks hasn't caught up and implemented OIDC to leverage this.
It would greatly improve the platform experience of Databricks.
โ06-14-2024 10:03 PM - edited โ06-14-2024 10:05 PM
I recommend avoiding fetching from a Git provider directly to run code during workflows and instead have a task that updates a Git folder within your workspace during the job (article with more details below). That way you can use Databricks to manage permissions to users and service principals and achieve granular isolation that is all within the platform and is easily traceable. For enterprises with on-prem Git data, this also avoids jobs failing due to the Git proxy server being down. The resource below sets up a great solution but one that is simpler is just having the job update the Git folder with retries whenever it is scheduled.
CI/CD techniques with Git and Databricks Git folders (Repos) | Databricks on AWS
Update: Forgot to add originally, but these are the non-PAT auth solutions Configure Git credentials & connect a remote repo to Databricks | Databricks on AWS
โ06-18-2024 01:08 AM
We do support both job code source options in our platform: fetch from Git provider and fetch from Git folder.
Our release process is driven by tags though (like described here), which is why we prefer to use direct fetch from the Git provider: you can pin the Git tag in a released workflow's definition and it transparently shows everywhere.
In principle, this process also works for Git folder-sourced workflows: we update Git folders from a Git tag as part of our CI/CD process. Unfortunately, this has the major shortcoming that the Databricks UI doesn't display the Git tag's name next to the repository, just uninformative detached... (see screenshot attached). This is why we tend to prefer Git-sourced code.
In my opinion, if the REST API admits a tag parameter, it should honor it by propagating it through to the UI ๐ I guess this is currently impossible because the Get a repo endpoint doesn't provide a tag attribute in its response the UI could get the tag from (I don't think it should be a major milestone to implement FWIW)
I love Databricks but sometimes have the opinion it should divert a bit more resources to improve the developer experience, so developers can easily implement their current good practices rather than having to force-mold them around Databricks' limitations.
On a side note: existing non-PAT solutions only work for users and not for service principals.
โ06-18-2024 03:11 AM
@camilo_s wrote:I love Databricks but sometimes have the opinion it should divert a bit more resources to improve the developer experience, so developers can easily implement their current good practices rather than having to force-mold them around Databricks' limitations.
Agree!
3 weeks ago
Hi Camilo,
Thank you for the thoughtful feedback. Our implementation does not currently not limit what type of Git provider identity the user chooses to use with the Databricks SP. Git providers have different support and recommendation for service accounts.
Which Git provider does your organization use?
2 weeks ago
Thanks for your reply @nicole_lu_PM,
We use Azure DevOps and Microsoft Entra ID service principals for automation. In our CI/CD we're able to use service principal authentication via the Azure CLI to interact with Databricks workspaces via the Databricks-SDK/CLI and we have no pain points with this process.
If I understand the documentation and blog post correctly, the authentication there refers to the connection Databricks Git folder <-> Git provider via a Git credential for an Entra ID SP. We currently do this essentially as described in those documents, with one difference: you don't really need to create an additional client secret for the Entra ID SP in the Databricks account because the Entra ID client secret generated in Step 1 of the documentation suffices (both the SDK and CLI can pick the authentication context of az login and request the Entra ID tokens in the background).
Our pain point and where I believe Databricks' platform ergonomics is just still unripe is in the connection Job running as service principal <-> Git provider when you use version-controlled source code in a Databricks job.
Identity providers do support such use-case as well (Microsoft Entra ID with Workload Identity Federation, GitHub with OAuth Apps) and I'll be happy when Databricks supports it and we as platform engineers can get rid of the workarounds we need to make up for it.
a week ago
Thank you for your great feedback camilo_s. We acknowledge that the EntraID service principal git cred user journey is cumbersome, especially when you try to use it with a git job. I agree that the best approach is for the product to build a non-PAT, OAuth-based integration that works for SP.
For now, we are working internally to produce a sample for getting an EntraId SP git cred to work in a git job like this:
* Make the Entra flow the first task in a job, and make the original job the second task
* In the Entra flow, execute the following code: however, instead of using read
for inputs, store all the inputs in a secret scope and use them to get a fresh EntraId token for databricks git-credentials update
a week ago
That would be a valid workaround (with the caveat that if a job's tasks run long enough, the token may expire). I also agree that this authentication flow should be ideally provided by Databricks as a feature, given that likely many customers face the same challenge when relying on jobs that run as service principals while sourcing code from a Git provider directly.
Thanks are due to you @nicole_lu_PM for engaging in the discussion and honestly caring for customers!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group