Thursday
Hello, community!
I have a question about deploying workflows in a production environment. Specifically, how can we deploy a group of workflows to production so that they are created only once and cannot be duplicated by others?
Currently, if someone deploys a GitHub repository containing DABs definitions, it creates new workflows that are accessible only to the person who deployed them. However, in a production scenario, workflows should be deployed just once, and no one should be able to create duplicates.
Is there a specific command or configuration in DABs to prevent this issue?
Additionally, is it possible to assign a group of people permissions to start and stop the workflows created in production?
Thanks, as always!
Thursday
Hello Jeremy, many thanks for reaching out, the intention is that new users just triggers the existing workflow instead of creating a new one via DABs correct?
Thursday
Hi @jeremy98,
You can explore the DABs of run as, you can use the run_as
configuration in your DABs. This configuration ensures that the workflows are created only once and cannot be duplicated by others.
https://docs.databricks.com/en/dev-tools/bundles/deployment-modes.html
Friday - last edited Friday
So, I don't understand if there is a possibility to overwrite the same workflow because could be a mess if someone changes a cluster configuration I want to be sure that there is only one workflow activated with the new configuration. I'm saying those things, because there was deployed in production one workflow but this one was replicated with a new cluster configuration, but should be overwritten the existed one, why it creates a new workflow
prod:
workspace:
host: <host_url>
root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}
mode: production
# permissions:
# - user_name: ${workspace.current_user.userName}
# level: CAN_MANAGE
run_as:
service_principal_name: <sp_id>
sync:
exclude:
- ./notebook/stg/*.*
resources:
jobs:
sync_delta_and_db:
name: sync_delta_and_db_${bundle.target}
schedule: # runs the job every day at 3AM
quartz_cron_expression: "0 0 3 * * ?"
timezone_id: "UTC"
tasks:
- task_key: sync_delta_${bundle.target}
job_cluster_key: sync_delta_${bundle.target}_cluster
notebook_task:
notebook_path: ./notebook/${bundle.target}/db_sync_initial_wip.ipynb
source: WORKSPACE
libraries:
- whl: ${workspace.root_path}/files/dist/<lib>-0.0.1-py3-none-any.whl
job_clusters: # TODO: this needs to be resized once we understand how to handle massive data properly
- job_cluster_key: sync_delta_${bundle.target}_cluster
new_cluster:
spark_version: 15.4.x-scala2.12
node_type_id: Standard_DS3_v2
runtime_engine: PHOTON
num_workers: 0
spark_conf:
spark.databricks.cluster.profile: singleNode
spark.master: local[*]
custom_tags:
ResourceClass: SingleNode
Thursday
About your second question. You can use the UI to add Can_Manage permission on workflow job to a group.
https://docs.databricks.com/en/jobs/privileges.html
https://kb.databricks.com/en_US/security/bulk-update-workflow-permissions-for-a-group
Friday
Thanks guys, for those guys I'm going to try them!
Friday
Does the name of the workflow remained the same? or the job name was changed? If the same exact name does it shows the duplicate name in the UI?
Friday
Hi Walter, the name of the workflow is the same. The only thing that I changed is the compute configuration that I changed to PHOTON configuration without using it. Also the creator of the workflow, that in this case the first creation was made by my colleague, instead of the new one that I created thinking that overwritten the existed one instead isn't in this way.. how to solve this problem? Is it possible to have only one workflow :(?
Saturday
I had this night another issue:
run failed with error message Unable to access the notebook "/Workspace/Users/<user email>/.bundle/rnc_data_pipelines/prod/files/notebook/prod/db_sync_initial_wip". Either it does not exist, or the identity used to run this job, sp-prod-databricks (<id of sp>), lacks the required permissions.
yesterday
Hi guys, news?
yesterday
I got some information from my internal team:
The main thing to help here is deploying as a service principal and setting mode: production
on the target. This is best done by setting up automation, such as Github Actions or Azure DevOps pipeline. You may choose a different service principal as the run as user but would need to set permissions for whoever will run.You can set permissions at a few levels in DABs, so if you decide a service principal will be the owner every time you deploy then you just set appropriate run permissions for the various groups or SPs that need access.
https://docs.databricks.com/en/dev-tools/bundles/permissions.html
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group