10-29-2024 09:16 AM
I have a process in DBX/DAB and I am using Service Principal for generating a token for reaching the artifacts feed, for security this token lasts 1 hour.
import requests
YOUR_AZURE_TENANT_ID = ...
YOUR_SERVICE_PRINCIPAL_CLIENT_ID = ...
YOUR_SECRET_SCOPE = ...
YOUR_SECRET_KEY = ...
SCOPE = ... # Scope for Azure DevOps Services API
url = f'https://login.microsoftonline.com/{YOUR_AZURE_TENANT_ID}/oauth2/v2.0/token'
payload = {'grant_type' : 'client_credentials',
'client_id' : YOUR_SERVICE_PRINCIPAL_CLIENT_ID,
'client_secret': dbutils.secrets.get(scope=YOUR_SECRET_SCOPE, key=YOUR_SECRET_KEY),
'scope': SCOPE}
files = [
...
]
headers = {
...
}
response = requests.request("POST", url, headers=headers, data = payload, files = files)
# get the "access_token" from the response
access_token = response.json().get("access_token")
print(access_token)
I want to have a workflow that runs the token generation (which is the above code) and then run the original workflow.
The thing is that I made a workflow that does that, but the original workflow starts searching for libraries instead of waiting for setting the token of the first workflow.
When I achieve this independently it works (first generate the token then the original workflow runs), but it is cost inneficient, because I need to renew the token each hour.
workflows:
- name: dev-workflow-process
job_clusters:
- job_cluster_key: dev
new_cluster:
init_scripts:
- workspace:
destination: "/Workspace/Shared/SP-LIB-INSTALLATION/init_pip_extra_index_url.sh"
spark_version: "11.3.x-cpu-ml-scala2.12"
driver_node_type_id: "Standard_F8"
node_type_id: "Standard_F8"
num_workers: 2
spark_env_vars:
TOKEN_FILENAME: "pip_token"
- job_cluster_key: "generate-token"
new_cluster:
spark_version: "11.3.x-cpu-ml-scala2.12"
driver_node_type_id: "Standard_F8"
node_type_id: "Standard_F8"
num_workers: 1
tasks:
- task_key: "generate-token"
job_cluster_key: "generate-token"
notebook_task:
notebook_path: "/Workspace/Shared/SP-LIB-INSTALLATION/GenerateAndSaveToken" # Updated path
base_parameters:
TOKEN_FILENAME: "pip_token" # Specify the desired token file name here
- task_key: "main-task"
depends_on:
- task_key: "generate-token"
job_cluster_key: !? $.env
python_wheel_task:
package_name: "dev-workflow-process"
entry_point: "entrypoint"
parameters:
- "--conf-file"
- "file:fuse://conf/tasks/main_task_config.yml"
build:
python: "poetry"
Is there a way to force the second workflow to wait for the token to be generated?
10-29-2024 09:31 AM
Hi @PabloCSD,
If the workflows are configured within a single Databricks job, you can use depends_on to ensure the second workflow waits for the completion of the generate-token task. This works well for cases where both workflows are in the same job context.
Another option is if the workflows are in separate Databricks Jobs and cannot be configured in the same context, you could schedule the second workflow to start a few minutes after the first, estimating the time needed to generate the token.
These strategies allow the second workflow to explicitly wait for the token generation to complete, reducing risks of errors and improving efficiency in loading libraries and other components that depend on the token.
Try and comment!Regards.
10-29-2024 10:13 AM
Hello @agallard,
I tried with the first solution but the workflow tries to install the dependencies used in the second workflow but it didn't work (in the posted .yaml I already have a depends_on).
The workflow runs when the user requieres it, that is why I'd want to connect the generation and the main task in one workflow.
The second solution could be too expensive if I schedule it each hour, for having a cluster only for generating a token schedulized that way.
I don't know if there are available cheaper clusters that are serverless for this process.
Thanks for your insights
10-30-2024 06:20 AM
Hi @PabloCSD,
here are some refined solutions that keep costs low and ensure the main workflow waits until the token is generated:
Instead of separating the token generation and main tasks, consider generating the token directly within the initialization script of the main workflow. This way, the token is created each time the workflow is triggered, and the main task can use it immediately.
For example you can use this snippet for init_pip_extra_index_url.sh:
# Generate token and set as environment variable
TOKEN=$(curl -X POST https://login.microsoftonline.com/$YOUR_AZURE_TENANT_ID/oauth2/v2.0/token \
-d "grant_type=client_credentials" \
-d "client_id=$YOUR_SERVICE_PRINCIPAL_CLIENT_ID" \
-d "client_secret=$(databricks secrets get --scope $YOUR_SECRET_SCOPE --key $YOUR_SECRET_KEY)" \
-d "scope=$SCOPE" | jq -r '.access_token')
# Export the token for PIP installation
export PIP_EXTRA_INDEX_URL="https://$TOKEN"
This approach means the token is generated only when the main workflow runs, eliminating the need for a separate, scheduled token generation task.
If you still prefer having separate tasks but are concerned about costs, Databricks serverless clusters are indeed a good fit for the token generation task. Serverless clusters provide a lower-cost option by only billing for the active compute time, making them cost-effective for tasks with short runtimes, such as token generation.
Since you’re already using depends_on, but experiencing issues with dependency installation starting prematurely, there may be a solution in configuring task-level dependency handling directly in the YAML.
To ensure dependencies aren’t installed until the token is ready:
These strategies should help create a unified and cost-efficient workflow without prematurely initiating dependency installations. Let me know how it works out or if any further refinements are needed!
Best regards.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group