Databricks Community

Borkadd · ‎12-08-2023

I am trying to create a multi-task Databricks Job in Azure Cloud with its own cluster.

Although I was able to create a single task job without any issues, the code to deploy the multi-task job fails due to the following cluster validation error:

error: 1 error occurred:
        * cannot create job: Cluster validation error: Missing required field: settings.cluster_spec.new_cluster.size

he code to create the Job is the following:

job = Job(
            resource_name = f"{job_name}-job",
            args=JobArgs(
                name = f"{job_name}-job",
                job_clusters=[
                    JobJobClusterArgs(
                        job_cluster_key="pulumiTest-basic-cluster",
                        new_cluster=JobJobClusterNewClusterArgs(
                            spark_version="13.3.x-scala2.12",
                            cluster_name="",
                            num_workers=0,
                            node_type_id="Standard_DS3_v2",
                            enable_elastic_disk=True,
                            runtime_engine="STANDARD",
                            spark_conf={
                                # f"fs.azure.account.key.{self.storage_account_name}.dfs.core.windows.net": "{{secrets/pulumiTest-secret-scope/puluTest-storage-access-token}}"
                                "spark.master": "local[*,4]",
                                "spark.databricks.cluster.profile": "singleNode"
                            },
                            custom_tags={
                                "ResourceClass": "SingleNode"
                            },
                            data_security_mode="LEGACY_SINGLE_USER_STANDARD"
                        )
                    )
                ],
                computes=[
                    JobComputeArgs(
                        compute_key="landing_task",
                        spec=JobComputeSpecArgs(kind="spark_python_task")
                    ),
                    JobComputeArgs(
                        compute_key="staging_task",
                        spec=JobComputeSpecArgs(kind="spark_python_task")
                    ),
                    JobComputeArgs(
                        compute_key="refined_task",
                        spec=JobComputeSpecArgs(kind="spark_python_task")
                    )
                ],
                tasks = [
                    JobTaskArgs(
                        task_key="landing_task",
                        job_cluster_key="pulumiTest-basic-cluster",
                        spark_python_task=JobSparkPythonTaskArgs(
                            python_file="/pipelineExample/landing.py",
                            source="GIT"
                        ),
                        run_if="ALL_SUCCESS",
                        libraries=[
                            JobLibraryArgs(
                                whl=whl_path
                            )
                        ]
                    ),
                    JobTaskArgs(
                        task_key="staging_task",
                        job_cluster_key="pulumiTest-basic-cluster",
                        spark_python_task=JobSparkPythonTaskArgs(
                            python_file="/pipelineExample/staging.py",
                            source="GIT"
                        ),
                        depends_ons=[
                            JobTaskDependsOnArgs(
                                task_key="landing_task"
                            )
                        ],
                        run_if="ALL_SUCCESS",
                        libraries=[
                            JobLibraryArgs(
                                whl=whl_path
                            )
                        ]
                    ),
                    JobTaskArgs(
                        task_key="refined_task",
                        job_cluster_key="pulumiTest-basic-cluster",
                        spark_python_task=JobSparkPythonTaskArgs(
                            python_file="/pipelineExample/refined.py",
                            source="GIT"
                        ),
                        depends_ons=[
                            JobTaskDependsOnArgs(
                                task_key="staging_task"
                            )
                        ],
                        run_if="ALL_SUCCESS",
                        libraries=[
                            JobLibraryArgs(
                                whl=whl_path
                            )
                        ]
                    )
                ],
                git_source=JobGitSourceArgs(
                    url=git_url,
                    provider="gitHub",
                    branch="main"
                ),
                format="MULTI_TASK"
            )
        )
)
pulumi.export('Job URL', job.url)

Does anyone know where the problem could be?

Borkadd · ‎12-11-2023

Hello @Retired_mod, thanks for your answer, but the problem keeps the same. I had already tested with different cluster configurations, single-node and multi-node, including those cluster configurations which worked with single task jobs, but the error does not change, it is always about the new cluster size.

According to documentation here: https://www.pulumi.com/registry/packages/databricks/api-docs/job/#jobnewcluster I understand that I need to set the cluster specifications in the parameter job_clusters, not in new_cluster as with single task jobs.

Databricks Community

Multi Task Job creation through Pulumi

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences