Databricks Community

PabloCSD · ‎10-29-2024

I have a process in DBX/DAB and I am using Service Principal for generating a token for reaching the artifacts feed, for security this token lasts 1 hour.

import requests

YOUR_AZURE_TENANT_ID = ...
YOUR_SERVICE_PRINCIPAL_CLIENT_ID = ...
YOUR_SECRET_SCOPE = ...
YOUR_SECRET_KEY = ...
SCOPE = ... # Scope for Azure DevOps Services API

url = f'https://login.microsoftonline.com/{YOUR_AZURE_TENANT_ID}/oauth2/v2.0/token'
 
payload = {'grant_type' : 'client_credentials',
           'client_id' : YOUR_SERVICE_PRINCIPAL_CLIENT_ID,
           'client_secret': dbutils.secrets.get(scope=YOUR_SECRET_SCOPE, key=YOUR_SECRET_KEY),
           'scope': SCOPE}
 
files = [
    ...
]
 
headers = {
    ...
}

response = requests.request("POST", url, headers=headers, data = payload, files = files)

# get the "access_token" from the response
access_token = response.json().get("access_token")

print(access_token)

I want to have a workflow that runs the token generation (which is the above code) and then run the original workflow.

Generate Token generates the token and leaves it in a location defined in the .yaml
The main workflow task in its init script gets the content of the file and sets the PIP_EXTRA_INDEX_URL env var.

The thing is that I made a workflow that does that, but the original workflow starts searching for libraries instead of waiting for setting the token of the first workflow.

When I achieve this independently it works (first generate the token then the original workflow runs), but it is cost inneficient, because I need to renew the token each hour.

  workflows:
    - name: dev-workflow-process

      job_clusters:
        - job_cluster_key: dev
          new_cluster:
            init_scripts:
              - workspace:
                  destination: "/Workspace/Shared/SP-LIB-INSTALLATION/init_pip_extra_index_url.sh"
            spark_version: "11.3.x-cpu-ml-scala2.12"
            driver_node_type_id: "Standard_F8"
            node_type_id: "Standard_F8"
            num_workers: 2
            spark_env_vars:
              TOKEN_FILENAME: "pip_token"

        - job_cluster_key: "generate-token"
          new_cluster:
            spark_version: "11.3.x-cpu-ml-scala2.12"
            driver_node_type_id: "Standard_F8"
            node_type_id: "Standard_F8"
            num_workers: 1
      tasks:
        - task_key: "generate-token"
          job_cluster_key: "generate-token"
          notebook_task:
            notebook_path: "/Workspace/Shared/SP-LIB-INSTALLATION/GenerateAndSaveToken"  # Updated path
            base_parameters:
              TOKEN_FILENAME: "pip_token"  # Specify the desired token file name here

        - task_key: "main-task"
          depends_on:
          - task_key: "generate-token"
          job_cluster_key: !? $.env
          python_wheel_task:
              package_name: "dev-workflow-process"
              entry_point: "entrypoint"
              parameters:
              - "--conf-file"
              - "file:fuse://conf/tasks/main_task_config.yml"
build:
  python: "poetry"

Is there a way to force the second workflow to wait for the token to be generated?

agallard · ‎10-29-2024

Hi @PabloCSD,

If the workflows are configured within a single Databricks job, you can use depends_on to ensure the second workflow waits for the completion of the generate-token task. This works well for cases where both workflows are in the same job context.

Another option is if the workflows are in separate Databricks Jobs and cannot be configured in the same context, you could schedule the second workflow to start a few minutes after the first, estimating the time needed to generate the token.

First Workflow: Scheduled to run every hour or at specific intervals.
Second Workflow: Scheduled to start 5–10 minutes after the first, ensuring the token is ready.

These strategies allow the second workflow to explicitly wait for the token generation to complete, reducing risks of errors and improving efficiency in loading libraries and other components that depend on the token.

Try and comment!Regards.

Alfonso Gallardo
-------------------
 I love working with tools like Databricks, Python, Azure, Microsoft Fabric, Azure Data Factory, and other Microsoft solutions, focusing on developing scalable and efficient solutions with Apache Spark

PabloCSD · ‎10-29-2024

Hello @agallard,

I tried with the first solution but the workflow tries to install the dependencies used in the second workflow but it didn't work (in the posted .yaml I already have a depends_on).

The workflow runs when the user requieres it, that is why I'd want to connect the generation and the main task in one workflow.

The second solution could be too expensive if I schedule it each hour, for having a cluster only for generating a token schedulized that way.

I don't know if there are available cheaper clusters that are serverless for this process.

Thanks for your insights

agallard · ‎10-30-2024

Hi @PabloCSD,

here are some refined solutions that keep costs low and ensure the main workflow waits until the token is generated:

Instead of separating the token generation and main tasks, consider generating the token directly within the initialization script of the main workflow. This way, the token is created each time the workflow is triggered, and the main task can use it immediately.

In the init_pip_extra_index_url.sh script:
1. Add the code to request the token within the initialization script.
2. Store the token in an environment variable, such as PIP_EXTRA_INDEX_URL.

For example you can use this snippet for init_pip_extra_index_url.sh:

# Generate token and set as environment variable
TOKEN=$(curl -X POST https://login.microsoftonline.com/$YOUR_AZURE_TENANT_ID/oauth2/v2.0/token \
        -d "grant_type=client_credentials" \
        -d "client_id=$YOUR_SERVICE_PRINCIPAL_CLIENT_ID" \
        -d "client_secret=$(databricks secrets get --scope $YOUR_SECRET_SCOPE --key $YOUR_SECRET_KEY)" \
        -d "scope=$SCOPE" | jq -r '.access_token')

# Export the token for PIP installation
export PIP_EXTRA_INDEX_URL="https://$TOKEN"

This approach means the token is generated only when the main workflow runs, eliminating the need for a separate, scheduled token generation task.

If you still prefer having separate tasks but are concerned about costs, Databricks serverless clusters are indeed a good fit for the token generation task. Serverless clusters provide a lower-cost option by only billing for the active compute time, making them cost-effective for tasks with short runtimes, such as token generation.

To set up a serverless cluster:
- Use a "Jobs Compute" cluster type in your Databricks environment if your subscription supports it. This cluster type spins up quickly and incurs costs only during active processing.
- Configure the generate-token task to use this serverless setup.

Since you’re already using depends_on, but experiencing issues with dependency installation starting prematurely, there may be a solution in configuring task-level dependency handling directly in the YAML.

To ensure dependencies aren’t installed until the token is ready:

Make sure the init_pip_extra_index_url.sh script (with token generation logic) runs before dependencies are required in main-task.
If using a package manager like poetry or custom Python packages, add a sleep command to the initialization script for a brief pause, allowing the token to be fully generated before dependencies start downloading.

These strategies should help create a unified and cost-efficient workflow without prematurely initiating dependency installations. Let me know how it works out or if any further refinements are needed!

Best regards.

Alfonso Gallardo
-------------------
 I love working with tools like Databricks, Python, Azure, Microsoft Fabric, Azure Data Factory, and other Microsoft solutions, focusing on developing scalable and efficient solutions with Apache Spark

Databricks Community

Generate a Workflow that Waits for Library Installation

Photos

Join Us as a Local Community Builder!

Exciting Opportunity to Collaborate with Us!

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Share Your Thoughts on Databricks & Get Rewarded!

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April