Databricks Community

PabloCSD · ‎10-14-2024

I have a CI/CD process that after a Pull Request (PR) to main it deploys to staging.

It works using a Personal Access Token using Azure Pipelines.

From local, deploying using Service Principal works (https://community.databricks.com/t5/administration-architecture/use-a-service-principal-token-instea...).

But I want to deploy from Azure Pipelines using the Service Principal. How can I do that?

If it helps, I have the local Azure Pipelines yaml.

    jobs:
      - job: onMainPullRequestJob
        workspace:
          clean: all
        steps:
          - task: UsePythonVersion@0
            displayName: Set up Python 3.10
            inputs:
              versionSpec: '3.10'

          - script:  curl -sSL https://install.python-poetry.org | python - --version 1.8.3
            displayName: Install Poetry

          - script: poetry config http-basic.$(ARTIFACT-FEED) $(USERNAME-FEED) $(System.AccessToken)
            displayName: Configure credentials to Feed

          - script: poetry install --with dev,test
            displayName: Install dependencies

          - script: poetry run pre-commit run --all-files
            displayName:  Run pre-commit check

          - script: poetry run pytest tests/unit -s -vvv
            displayName: Run unit tests

          - script: curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
            displayName: Install Databricks CLI

          - script: |
              databricks bundle deploy --target staging
            env:
              DATABRICKS_HOST: $(DATABRICKS-HOST)
              DATABRICKS_TOKEN: $(DATABRICKS-TOKEN)
            displayName: Deploy the job

          - script: |
              databricks bundle run --target staging dab_job
            env:
              DATABRICKS_HOST: $(DATABRICKS-HOST)
              DATABRICKS_TOKEN: $(DATABRICKS-TOKEN)
            displayName: Launch worflow

PabloCSD · ‎10-15-2024

I needed to deploy a job using CI/CD Azure Pipelines without using the OAuth, this is the way:

First you need to have configured the Service Principal, for that you need to generate it in your workspace with this you will have:

A host: Which is your workspace url which follows this pattern: https://adb-XXXXXXXXXXXXXXXX.YY.azuredatabricks.net/
A client_id: Which will be generated when you generate a secret
A client_secret: Which will be generated when you generate a secret

Now you need to do some pre-requisites:

Have an Azure Pipeline with your project with its corresponding azure-pipeline.yaml for CI/CD
Configure to the pipeline a variable for the HOST, the CLIENT_SECRET and the CLIENT_ID (in the below they are like these: YOUR-SERVICE-PRINCIPAL-CLIENT-ID, YOUR-SERVICE-PRINCIPAL-SECRET and YOUR-DATABRICKS-HOST).

Having this configured I show an azure-pipelines.yml file with the configurations I have:

pool: Azure Pipelines

trigger: none

pr:
  autoCancel: true
  branches:
    include:
      - main

stages:
  - stage: onDevPullRequest
    # Similar to onMainPullRequest
    ...

  - stage: onMainPullRequest
    # This stage is triggered when a PR is created into main
    # For instance, if you create a PR from feature/1.0.0 to main, this stage will be triggered
    # This stage will be skipped if the PR is created from release/* to main
    condition: |
      and(
        not(startsWith(variables['System.PullRequest.SourceBranch'], 'refs/heads/release')),
        startsWith(variables['System.PullRequest.TargetBranch'], 'refs/heads/main')
      )
    jobs:
      - job: onMainPullRequestJob
        workspace:
          clean: all
        steps:
          - task: UsePythonVersion@0
            displayName: Set up Python 3.10
            inputs:
              versionSpec: '3.10'

          - script:  curl -sSL https://install.python-poetry.org | python - --version 1.8.3
            displayName: Install Poetry

          - script: poetry config http-basic.$(ARTIFACT-FEED) $(USERNAME-FEED) $(System.AccessToken)
            displayName: Configure credential with the artifact feed

          - script: poetry install --with dev,test
            displayName: Install dependencies

          - script: poetry run pre-commit run --all-files
            displayName:  Run pre-commit check

          - script: poetry run pytest tests/unit -s -vvv
            displayName: Run unit tests

          - bash: |
              # Install Databricks CLI
              curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

              # Verify installation
              databricks --version

              # Create databricks config file
              file="~/.databrickscfg"

              # If exists, remove it
              if [ -f "$file" ] ; then
                  rm "$file"
              fi

              # Define the profile
              echo "[YOUR_WORKSPACE_PROFILE]" >> ~/.databrickscfg
              echo "host = $(YOUR-DATABRICKS-HOST)" >> ~/.databrickscfg
              echo "token = $(YOUR-SERVICE-PRINCIPAL-SECRET)" >> ~/.databrickscfg

              # Show the file
              cat ~/.databrickscfg

              # Export token
              export DATABRICKS_TOKEN_SP=$(curl --request POST \
                --url https://adb-XXXXXXXXXXXXXXXX.YY.azuredatabricks.net/oidc/v1/token \
                --user "$(YOUR-SERVICE-PRINCIPAL-CLIENT-ID):$(YOUR-SERVICE-PRINCIPAL-SECRET)" \
                --data 'grant_type=client_credentials&scope=all-apis' | jq -r '.access_token'
                )

              # Export for the usage of the next task
              echo "##vso[task.setvariable variable=DATABRICKS_TOKEN_SP]$DATABRICKS_TOKEN_SP"
            displayName: Install Databricks CLI and create config file

          - script: |
              databricks bundle deploy --target staging
            env:
              DATABRICKS_HOST: $(DATABRICKS-HOST)
              DATABRICKS_TOKEN: $(DATABRICKS_TOKEN_SP)
            displayName: Deploy the job

          - script: |
              databricks bundle run --target staging dab_your_workflow
            env:
              DATABRICKS_HOST: $(DATABRICKS-HOST)
              DATABRICKS_TOKEN: $(DATABRICKS_TOKEN_SP)
            displayName: Launch worflow

  - stage: onRelease
    # Similar to onMainPullRequest
    ...

Steps:

Install the Databricks CLI
Generate a Databricks Configuration File for the profile you are deploying: ~/.databrickscfg
Generate a temporal token for deploying the job
Export the variable for using it in the deploying task
Deploy

With this way we assure to use the same environment variable "DATABRICKS-TOKEN-SP" in the next task. Also, we don't have to use OAuth for configuring the CI/CD which is a great step for user independent CI/CD processes.

If you need a template for the used databricks.yml here it is:

bundle:
  name: dab_your_workflow

# Declare to Databricks Assets Bundles that this is a Python project
# This is the interaction with the "pyproject.toml" file
artifacts:
  default:
    type: whl
    build: poetry build
    path: .

resources:
  jobs:
    dab_your_workflow:
      name: dab_your_workflow
      tasks:
        - task_key: your_workflow_task
          job_cluster_key: ${bundle.target}-${bundle.name}-job-cluster
          python_wheel_task:
             package_name: dab_your_workflow
             entry_point: your_workflow_entry_point
             parameters:
               - --conf-file
               - "/Workspace${workspace.root_path}/files/conf/tasks/your_workflow_task_config.yml"
               - --env
               - ${bundle.target}
          libraries:
            - whl: ./dist/*.whl

targets:
  dev:
    # Similar to Staging
    ...

  prod:
    # Similar to Staging
    ...

  staging:
    mode: production
    workspace:
      host: https://adb-XXXXXXXXXXXXXXXX.YY.azuredatabricks.net/
    run_as:
      service_principal_name: AAAAAAAA-BBBB-CCCC-DDDD-EEEEEEEEEEEE
    resources:
      jobs:
        dab_your_workflow:
          job_clusters:
            - job_cluster_key: ${bundle.target}-${bundle.name}-job-cluster
              new_cluster:
                num_workers: 2
                spark_version: "14.3.x-cpu-ml-scala2.12"  # Specify the Spark version
                spark_conf:
                  # You can specify the Spark configuration
                node_type_id: Standard_F8 # Specify the node type
                spark_env_vars:
                  # You can specify the Spark environment variables for example:
                  PIP_EXTRA_INDEX_URL:  "{{secrets/kv-your-key-vault/your-url-for-pip-extra-index-url}}"

I believe that there are ways to "improve this bonsai answer", so anyway to improve it, please comment it.

View solution in original post

PabloCSD · ‎10-15-2024