Databricks Community

jeremy98 · 2 weeks ago

Hello, community!

I have a question about deploying workflows in a production environment. Specifically, how can we deploy a group of workflows to production so that they are created only once and cannot be duplicated by others?

Currently, if someone deploys a GitHub repository containing DABs definitions, it creates new workflows that are accessible only to the person who deployed them. However, in a production scenario, workflows should be deployed just once, and no one should be able to create duplicates.

Is there a specific command or configuration in DABs to prevent this issue?

Additionally, is it possible to assign a group of people permissions to start and stop the workflows created in production?

Thanks, as always!

Walter_C · 2 weeks ago

Hello Jeremy, many thanks for reaching out, the intention is that new users just triggers the existing workflow instead of creating a new one via DABs correct?

Alberto_Umana · 2 weeks ago

Hi @jeremy98,

You can explore the DABs of run as, you can use the run_as configuration in your DABs. This configuration ensures that the workflows are created only once and cannot be duplicated by others.

https://docs.databricks.com/en/dev-tools/bundles/deployment-modes.html

jeremy98 · a week ago

So, I don't understand if there is a possibility to overwrite the same workflow because could be a mess if someone changes a cluster configuration I want to be sure that there is only one workflow activated with the new configuration. I'm saying those things, because there was deployed in production one workflow but this one was replicated with a new cluster configuration, but should be overwritten the existed one, why it creates a new workflow

  prod:
    workspace:
      host: <host_url>
      root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}
    
    mode: production

    # permissions:
    #  - user_name: ${workspace.current_user.userName}
    #    level: CAN_MANAGE

    run_as:
      service_principal_name: <sp_id>

    sync:
      exclude: 
        - ./notebook/stg/*.*

    resources:
      jobs:
        sync_delta_and_db:
          name: sync_delta_and_db_${bundle.target}

          schedule: # runs the job every day at 3AM
            quartz_cron_expression: "0 0 3 * * ?"
            timezone_id: "UTC"

          tasks:
            - task_key: sync_delta_${bundle.target}
              job_cluster_key: sync_delta_${bundle.target}_cluster
              notebook_task:
                notebook_path: ./notebook/${bundle.target}/db_sync_initial_wip.ipynb
                source: WORKSPACE
              libraries:
                - whl: ${workspace.root_path}/files/dist/<lib>-0.0.1-py3-none-any.whl

          job_clusters: # TODO: this needs to be resized once we understand how to handle massive data properly
            - job_cluster_key: sync_delta_${bundle.target}_cluster
              new_cluster:
                spark_version: 15.4.x-scala2.12
                node_type_id: Standard_DS3_v2
                runtime_engine: PHOTON
                num_workers: 0
                spark_conf:
                  spark.databricks.cluster.profile: singleNode
                  spark.master: local[*]
                custom_tags:
                  ResourceClass: SingleNode

Alberto_Umana · 2 weeks ago

About your second question. You can use the UI to add Can_Manage permission on workflow job to a group.

https://docs.databricks.com/en/jobs/privileges.html

https://kb.databricks.com/en_US/security/bulk-update-workflow-permissions-for-a-group

jeremy98 · a week ago

Thanks guys, for those guys I'm going to try them!

Walter_C · a week ago

Does the name of the workflow remained the same? or the job name was changed? If the same exact name does it shows the duplicate name in the UI?

jeremy98 · a week ago

Hi Walter, the name of the workflow is the same. The only thing that I changed is the compute configuration that I changed to PHOTON configuration without using it. Also the creator of the workflow, that in this case the first creation was made by my colleague, instead of the new one that I created thinking that overwritten the existed one instead isn't in this way.. how to solve this problem? Is it possible to have only one workflow :(?

jeremy98 · a week ago

I had this night another issue:

run failed with error message Unable to access the notebook "/Workspace/Users/<user email>/.bundle/rnc_data_pipelines/prod/files/notebook/prod/db_sync_initial_wip". Either it does not exist, or the identity used to run this job, sp-prod-databricks (<id of sp>), lacks the required permissions.

jeremy98 · a week ago

Hi guys, news?

Walter_C · Monday

I got some information from my internal team:

The main thing to help here is deploying as a service principal and setting mode: production on the target. This is best done by setting up automation, such as Github Actions or Azure DevOps pipeline. You may choose a different service principal as the run as user but would need to set permissions for whoever will run.You can set permissions at a few levels in DABs, so if you decide a service principal will be the owner every time you deploy then you just set appropriate run permissions for the various groups or SPs that need access.
https://docs.databricks.com/en/dev-tools/bundles/permissions.html

Databricks Community

How to deploy unique workflows that running on production

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon