Databricks Community

dataengutility · ‎06-07-2024

Hi all,

I have been having some trouble running a workflow that consists of 3 tasks that run sequentially. Task1 runs on an all-purpose cluster and kicks off Task2 that needs to run on a job cluster. Task2 kicks off Task3 which also uses a job cluster.

We have identified that Task2 is running on an all-purpose cluster instead of a job cluster despite configuring the task to run using a job cluster in the yaml file for the asset bundle. This task is dependent on another task which does use the all-purpose cluster as specified in the yaml file. We tried modifying the yaml file but when running a databricks bundle validate, it looks like the task is being overwritten to use the all-purpose cluster despite explicitly indicating it to use the job cluster. Renaming the task names is being picked up by the validate command.

Here is a snippet of the yaml file:

tasks:
	- task_key: Task1
	  existing_cluster_id: all-purpose-cluster-id
	  notebook_task:
	    notebook_path: ../src/Task1.py
	    base_parameters:
	      catalog: ${var.catalog}
	      target: ${var.target}

	- task_key: Task2
	  job_cluster_key: job-cluster
	  depends_on:
	    - task_key: Task1
	  notebook_task:
	    notebook_path: ../src/Task2.py
	    base_parameters:
	      catalog: ${var.catalog}
	      target: ${var.target}

	- task_key: Task3
	  job_cluster_key: job-cluster
	  depends_on:
	    - task_key: Task2
	  notebook_task:
	    notebook_path: ../src/Task3.py
	    base_parameters:
	      catalog: ${var.catalog}
	      target: ${var.target}

After running databricks bundle validate, this is the output:

"tasks": [
          {
            "existing_cluster_id": "all-purpose-cluster-id",
            "notebook_task": {
              "base_parameters": {
                "catalog": "catalog",
                "target": "target"
              },
              "notebook_path": "/Users/user/.bundle/folder/dev/files/src/Task1"
            },
            "task_key": "Task1"
          },
          {
            "depends_on": [
              {
                "task_key": "Task1"
              }
            ],
            "existing_cluster_id": "all-purpose-cluster-id",
            "notebook_task": {
              "base_parameters": {
                "catalog": "catalog",
                "target": "target"
              },
              "notebook_path": "/Users/user/.bundle/folder/dev/files/src/Task2"
            },
            "task_key": "Task2"
          },
          {
            "depends_on": [
              {
                "task_key": "Task2"
              }
            ],
            "existing_cluster_id": "all-purpose-cluster-id",
            "notebook_task": {
              "base_parameters": {
                "catalog": "catalog",
                "target": "target"
              },
              "notebook_path": "/Users/user/.bundle/folder/dev/files/src/Task3"
            },
            "task_key": "Task3"
          }
        ]

As you can see, the all-purpose cluster id is replacing the job-cluster key for Task2 and Task3. The strangest part of all of this is that I'm the only one on the team that is experiencing this issue. Everyone else seems to be able to run the workflow without any issues. Any ideas on how to resolve this issue?

Thank you in advanced!