cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Behaviour of cluster launches in multi-task jobs

Serhii
Contributor

We are adapting the multi-tasks workflow example from dbx documentation for our pipelines https://dbx.readthedocs.io/en/latest/examples/python_multitask_deployment_example.html. As a part of configuration we specify cluster configuration and provide job_cluster_key

. Question: it seams that if consecutive jobs within the workflow use the same cluster, it is not reused between jobs but created anew. Is there a way to configure such that cluster is reused?

1 ACCEPTED SOLUTION

Accepted Solutions

User16873043099
Contributor

Tasks within the same multi task job can reuse the clusters. A shared job cluster allows multiple tasks in the same job to use the cluster. The cluster is created and started when the first task using the cluster starts and terminates after the last task using the cluster completes.

Reference: https://docs.databricks.com/workflows/jobs/jobs-api-updates.html

Sample API payload :

{
    "job_id": 123456789,
    "creator_user_name": "email@domain.com",
    "run_as_user_name": "email@domain.com",
    "run_as_owner": true,
    "settings": {
        "name": "MT job",
        "email_notifications": {
            "no_alert_for_skipped_runs": false
        },
        "timeout_seconds": 0,
        "max_concurrent_runs": 1,
        "tasks": [
            {
                "task_key": "task1",
                "notebook_task": {
                    "notebook_path": "/Users/email@domain.com/test",
                    "source": "WORKSPACE"
                },
                "job_cluster_key": "Shared_job_cluster",
                "timeout_seconds": 0,
                "email_notifications": {}
            },
            {
                "task_key": "task2",
                "depends_on": [
                    {
                        "task_key": "task1"
                    }
                ],
                "notebook_task": {
                    "notebook_path": "/Users/email@domain.com/test",
                    "source": "WORKSPACE"
                },
                "job_cluster_key": "Shared_job_cluster",
                "timeout_seconds": 0,
                "email_notifications": {}
            }
        ],
        "job_clusters": [
            {
                "job_cluster_key": "Shared_job_cluster",
                "new_cluster": {
                    "cluster_name": "",
                    "spark_version": "10.4.x-scala2.12",
                    "spark_conf": {
                        "spark.databricks.delta.preview.enabled": "true"
                    },
                    "azure_attributes": {
                        "first_on_demand": 1,
                        "availability": "ON_DEMAND_AZURE",
                        "spot_bid_max_price": -1
                    },
                    "node_type_id": "Standard_DS3_v2",
                    "spark_env_vars": {
                        "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
                    },
                    "enable_elastic_disk": true,
                    "runtime_engine": "STANDARD",
                    "num_workers": 1
                }
            }
        ],
        "format": "MULTI_TASK"
    },
    "created_time": 1660842831328
}

View solution in original post

1 REPLY 1

User16873043099
Contributor

Tasks within the same multi task job can reuse the clusters. A shared job cluster allows multiple tasks in the same job to use the cluster. The cluster is created and started when the first task using the cluster starts and terminates after the last task using the cluster completes.

Reference: https://docs.databricks.com/workflows/jobs/jobs-api-updates.html

Sample API payload :

{
    "job_id": 123456789,
    "creator_user_name": "email@domain.com",
    "run_as_user_name": "email@domain.com",
    "run_as_owner": true,
    "settings": {
        "name": "MT job",
        "email_notifications": {
            "no_alert_for_skipped_runs": false
        },
        "timeout_seconds": 0,
        "max_concurrent_runs": 1,
        "tasks": [
            {
                "task_key": "task1",
                "notebook_task": {
                    "notebook_path": "/Users/email@domain.com/test",
                    "source": "WORKSPACE"
                },
                "job_cluster_key": "Shared_job_cluster",
                "timeout_seconds": 0,
                "email_notifications": {}
            },
            {
                "task_key": "task2",
                "depends_on": [
                    {
                        "task_key": "task1"
                    }
                ],
                "notebook_task": {
                    "notebook_path": "/Users/email@domain.com/test",
                    "source": "WORKSPACE"
                },
                "job_cluster_key": "Shared_job_cluster",
                "timeout_seconds": 0,
                "email_notifications": {}
            }
        ],
        "job_clusters": [
            {
                "job_cluster_key": "Shared_job_cluster",
                "new_cluster": {
                    "cluster_name": "",
                    "spark_version": "10.4.x-scala2.12",
                    "spark_conf": {
                        "spark.databricks.delta.preview.enabled": "true"
                    },
                    "azure_attributes": {
                        "first_on_demand": 1,
                        "availability": "ON_DEMAND_AZURE",
                        "spot_bid_max_price": -1
                    },
                    "node_type_id": "Standard_DS3_v2",
                    "spark_env_vars": {
                        "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
                    },
                    "enable_elastic_disk": true,
                    "runtime_engine": "STANDARD",
                    "num_workers": 1
                }
            }
        ],
        "format": "MULTI_TASK"
    },
    "created_time": 1660842831328
}

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!