cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Running jobs as service principal, while pulling code from Azure DevOps

LuukDSL
New Contributor III

In our Dataplatform, our jobs are defined in a dataplatform_jobs.yml within a Databricks Asset Bundle, and then pushed to Databricks via an Azure Devops Pipeline (Azure Devops is where our codebase resides). Currently, this results in workflows looking like this, where they're created by the Dataplatform Service Principal, but are run as the username of a specific colleague: 

LuukDSL_0-1751983798686.png

We'd like to change this, so "Run as" is also the Service Principal. This will lead to easier maintenance, and we won't have trouble if this colleague leaves the team for example. However, our workflows are connected to our Devops repo, and run on the latest version of our dev/test/acc/prd branch. As a user this runs fine, as the PAT of that specific user is used for authentication. If we change it to the sp-dataplatform, we run into authentication issues.

We could add a PAT for sp-dataplatform manually, but then this is still tied to a specific user account. This doesn't really solve the issue. 

We also tried the Azure DevOps Services (Azure Active Directory) option for Git integration within the service principal, but I believe this is only used to pull Databricks repos to Devops, instead of the other way around?

There are a lot of links and threads related to this, such as:
https://community.databricks.com/t5/data-engineering/use-azure-service-principal-to-access-azure-dev...
https://learn.microsoft.com/en-us/azure/databricks/repos/automate-with-ms-entra
https://learn.microsoft.com/en-us/azure/databricks/jobs/how-to/run-jobs-with-service-principals
https://community.databricks.com/t5/data-engineering/run-task-as-service-principal-with-code-in-azur... 

I've experimented with these options as mentioned, but I think they all serve a slightly different use case. Some colleagues who worked on different projects also didn't have a 100% satisfactory solution for this. Are we missing something; is there a way in which we can configure this to work?

Thanks in advance!

14 REPLIES 14

ilir_nuredini
Honored Contributor

Hello @LuukDSL ,

Could you share a snippet of your CI/CD YAML file so we can give more specific advice?
I’ve connected Azure DevOps to Databricks using ARM credentials, and it was set the pipeline’s "Run as" user to a service principal, no extra steps were required.

Once you share the snippet, we can suggest the next steps.

Best, Ilir

Hi @ilir_nuredini,

Thanks for your response. Does your pipeline also run on a git source of your repository?

In our CD Pipeline.yml for Devops (in which we use Terraform) we have this stage for the asset bundles:
- stage: DeployDatabricksWorkflows
displayName: "Deploy Databricks Workflows with Asset Bundles"
condition: not(failed('ApplyTerraformEnv'))
variables:
workspace_url: $[ stageDependencies.OutputTerraformEnv.outputTerraformJob.outputs['readOutputTerraformTask.workspace_url'] ]
serverless_warehouse_id: $[ stageDependencies.OutputTerraformEnv.outputTerraformJob.outputs['readOutputTerraformTask.serverless_warehouse_id'] ]
jobs:
- job: "DeployAssetBundle"
steps:
- script: "curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh"
displayName: "Install Databricks CLI"
workingDirectory: .

- script: 'databricks bundle deploy --var="serverless_warehouse_id=$(serverless_warehouse_id)"'
displayName: "Deploy Asset Bundle"
workingDirectory: bundle
env:
DATABRICKS_HOST: $(workspace_url)
DATABRICKS_CLIENT_ID: $(client_id)
DATABRICKS_CLIENT_SECRET: $(auth_secret)
DATABRICKS_BUNDLE_ENV: $(target_branch)

 

In our dataplatform_jobs.yml, workflows and tasks are configured with these settings:

source: GIT
run_as: <email of colleague>
git_source:
git_url: https://our-organisation@dev.azure.com/our-organisation/Dataplatform/_git/dataplatform
git_provider: azureDevOpsServices

Hello @LuukDSL ,

This is how Iam connecting with Databricks:

env:
ARM_TENANT_ID: $(AZURE_SP_TENANT_ID)
ARM_CLIENT_ID: $(AZURE_SP_APPLICATION_ID)
ARM_CLIENT_SECRET: $(AZURE_SP_CLIENT_SECRET)

So Iam using SP creds to connect, and the Jobs gets assigned as owner and run as SP user.
And Yes it runs on git source of my repository.

Best, Ilir

Interesting! Is this also in your CD file where you use databricks bundle deploy? It looks similar to my env part, although you're using AZURE_SP_ variables. I suppose they're also used to make the connection to Databricks?

Also, did you specify somewhere how your SP can make contact with the repo, i.e. via Settings -> Identity and access (workspace admin) -> Service principals (manage), and then via the Git integration tab of your SP?

Hello @LuukDSL ,

That’s right, I’m using Azure SP variables to connect to Databricks.
However, the part where the SP connects to the repo happens outside of Databricks (e.g. Azure DevOps).
You don’t need to set up any Git integration using the SP, because once you push your code through DABs, it resides within Databricks, no further connection to Git is needed.

Best, Ilir

Hi @ilir_nuredini,

[...] because once you push your code through DABs, it resides within Databricks, no further connection to Git is needed.

I think this is what we might be doing differently. At the bottom of my first reply, I specified how the tasks in our workflows are configured via the dataplatform_jobs.yml file. Because we specify the git-source there, it leads to this configuration in the Jobs UI:

LuukDSL_0-1752758187672.png

So, every workflow run, the code is dynamically pulled. I suppose you're using the DAB in another way, where the whole repo is pushed to a repo in the Databricks Workspace?

That is right, the whole repo (bundle file structure) is pushed to the Databricks Workspace

can you try generating oauth databricks token for this sp and then pass this token to your databricks bundle deploy as env variables section instead of client id and secret.

 
 
 
Add following to your parameters or your preferred choice of deployment
namesp_app_id_dev
    displayNameService Principal App ID DEV (for oauth token)
    typestring
    default""

  - namesp_app_id_acc
    displayNameService Principal App ID ACC (for oauth token)
    typestring
    default""

  - namesp_app_id_prd
    displayNameService Principal App ID PRD (for oauth token)
    typestring
    default""
##############################################################
Add this job as first job:
######################
job : oauth_bearer_token_sp
        steps:
          - script: |
              wget https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux32 -O $(Build.Repository.LocalPath)/jq
              chmod +x $(Build.Repository.LocalPath)/jq
            displayNameInstall jq
            conditionsucceeded()
          - script: |
              if [[ ${{ variables.env}} -eq 'dev' ]]
              then
                CLIENT_ID=${{ parameters.sp_app_id_dev}}
                CLIENT_SECRET=$SP_SECRET_DEV
                DATABRICKS_WORKSPACE_URL=${{ parameters.databricks_wrkspc_url_dev}}
              elif [[ ${{ variables.env}} -eq 'acc' ]]
              then
                CLIENT_ID=${{ parameters.sp_app_id_acc}}
                CLIENT_SECRET=$SP_SECRET_ACC
                DATABRICKS_WORKSPACE_URL=${{ parameters.databricks_wrkspc_url_acc}}
              else
                CLIENT_ID=${{ parameters.sp_app_id_prd}}
                CLIENT_SECRET=$SP_SECRET_PRD
                DATABRICKS_WORKSPACE_URL=${{ parameters.databricks_wrkspc_url_prd}}
              fi
              DATABRICKS_URL="$DATABRICKS_WORKSPACE_URL/api/2.0/token/create"
              access_token_val=$(curl -X POST -H 'Content-Typeapplication/x-www-form-urlencoded' \
                             https://login.microsoftonline.com/af73baa8-f594-4eb2-a39d-93e96cad61fc/oauth2/v2.0/token \
                             -d "client_id=$CLIENT_ID" \
                             -d 'grant_type=client_credentials'\
                             -d 'scope=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d%2F.default' \
                             -d "client_secret=$CLIENT_SECRET")
              access_token=$(jq -r '.access_token' <<< "$access_token_val")
              echo $access_token

              api_response=$(curl -X POST $DATABRICKS_URL \
                            -H "AuthorizationBearer $access_token" \
                            -H "X-Databricks-Azure-SP-Management-Token:$access_token" \
                            -d '{"comment""pipeline token"}')
              echo "$api_response"
              DATABRICKS_NEW_TOKEN=$(jq -r '.token_value' <<< "$api_response")
              if [ -z "${DATABRICKS_NEW_TOKEN}" ]
              then
                echo "Token could not be created"
                exit 1
              else
                echo "Successfully created a Databricks Token"
                echo "##vso[task.setvariable variable=DATABRICKS_TOKEN;isOutput=true]$DATABRICKS_NEW_TOKEN"
                echo "##vso[task.setvariable variable=ACCESS_TOKEN;isOutput=true]$access_token"
              fi
            displayName'Create oauth token'
            nameoauth
            conditionsucceeded()
 
####################
pass this DATABRICKS_TOKEN to next stage or job as variables
 
 
    variables:
      DATABRICKS_TOKEN$[ Dependencies.oauth_bearer_token_sp.outputs['oauth.DATABRICKS_TOKEN'] ]
 
###############
 
use this DATABRICKS_TOKEN as env for aset bundle deploy script

LuukDSL
New Contributor III

Thanks for your reply. We use a few different jobs, so that would mean that all these jobs would require this task, right? This seems like a large manual approach, which you expect to be able to do automatically. Do you agree with that assessment?

Also, isn't this token that's created still a PAT? Or is this different as you use an Azure App ID for the SP? (I believe we only have one App ID btw.)

saurabh18cs
Honored Contributor

Hi,

Even if you have multiple jobs then why your deployment procedure is not one? It should be one common deployment pipeline right? It is more like your way of working but can be standaridized.

I am sharing this from my own experience when we pass client_id and secret then it seems like asset bundles is taking identity of sp as creator only but using deployer identity as run_as.

If you use the approach I suggested then the oauth token which we have generated has the identity of sp both for creator and run_as. I would say give it a try. Thanks

Thanks, that explains a lot! I will experiment with that approach and see if it works well for our use cases. Is there a risk that the code is manually changed by a user? Let's say that the acc-branch is pushed via the bundle file structure when a PR from tst has been merged. Now, the whole repo (on the acc-branch) will be pushed. Can a user now change the repo in Databricks (either accidentally or deliberately), or have you found a way to keep this locked?

Hello @LuukDSL ,

Yes, it's definitely possible that someone accidentally changed something in the bundle folder.

To prevent this, you have a couple of options:

  • You can restrict access to the folder entirely, or

  • You can grant VIEW access only, which means users can see the contents but won’t be able to edit the files without going through the standard process.

In either case, if users have access to the jobs, they'll still be able to run them regardless of their permissions on the actual files.

saurabh18cs
Honored Contributor

I have provided a solution of your problem , give a try and share feedback. Thanks

saurabh18cs
Honored Contributor

Hi @LuukDSL had you try the solution I have provided above ?