Databricks Community

VicS · ‎11-14-2024

I tried looking through the documentation but it is confusing at best and misses important parts at worst. Is there any place where the entire syntax and ALL options for asset bundle YAMLs are described?

I found this https://docs.databricks.com/en/dev-tools/bundles/index.html, but aside from rudimentary examples, it is incredibly difficult to figure out what key should be where in the YAML... Sometimes it was easier to check out the Terraform documentation for databricks jobs/tasks and work backwards from there, instead of reading the Databricks documentation.

I cannot figure out how to use job cluster with custom built whl files. I build one whl file on the fly and pull another as a dependency from a private Pypi server.

I was able to do it with a general purpose cluster which I created beforehand, but the job cluster never seems to install the libraries.

With a general purpose:

- task_key: ingestion
    existing_cluster_id: ${var.existing_cluster_id}
    python_wheel_task:
        package_name: my_package
        entry_point: my_package_ep
        libraries:
            - whl: ./dist/*.whl
            - pypi:
            package: another-package==1.0.0
            repo: https://pkgs.dev.azure.com/xxxx/xx/_packaging/xxx/pypi/simple/

I tried various possibilities where I thought the "libraries" key makes sense, but due to a lack of documentation I was not able to figure it out. Neither below the task, nor in the environment did it work.

      environments:
        - environment_key: myenv
          spec:
            client: "1"
            dependencies:
              - whl: ./dist/*.whl
              - pypi:
                  package: pyspark-framework==${var.pyspark_framework_version}
                  repo: https://pkgs.dev.azure.com/xxx/xxxx/_packaging/xxxx/pypi/simple/

      tasks:
        - task_key: mytask
          environment_key: myenv
          python_wheel_task:
            package_name: mypackage
            entry_point: mypackage_ep
            libraries:
              - whl: ./dist/*.whl
              - pypi:
                  package: pyspark-framework==${var.pyspark_framework_version}
                  repo: https://pkgs.dev.azure.com/xxx/xxxx/_packaging/xxxx/pypi/simple/

Can anyone tell me how to properly add my whl files (local dist + from pypi) to a job_cluster?

VicS · ‎11-14-2024

I also tried the following, but couldn't get it to work with my custom pypi index. Any help is appreciated.

environments:
  - environment_key: myenv
    spec:
      client: "1"
      dependencies:
        - another-package==1.0.0@https://pkgs.dev.azure.com/xx/xxx/_packaging/xxx/pypi/simple/

cgrant · ‎11-21-2024

With Classic compute, you'll want to modify your YAML to look like this (the key part is that libraries needs to be on the same level as existing_cluster_id, at the top-level for the task definition.

- task_key: ingestion
    existing_cluster_id: ${var.existing_cluster_id}
    python_wheel_task:
        package_name: my_package
        entry_point: my_package_ep
    libraries:
        - whl: ./dist/*.whl
        - pypi:
        package: another-package==1.0.0
        repo: https://pkgs.dev.azure.com/xxxx/xx/_packaging/xxx/pypi/simple/

With Serverless compute it looks slightly different - you'll need to specify an environment instead of a cluster, like this:

- task_key: ingestion
    existing_cluster_id: ${var.existing_cluster_id}
    python_wheel_task:
        package_name: my_package
        entry_point: my_package_ep
    environment_key: environment



environments:
- environment_key: environment
    spec:
    client: "2"
    dependencies:
        - --index-url https://pkgs.dev.azure.com/xxxx/xx/_packaging/xxx/pypi/simple/
        - another-package==1.0.0

It is generally recommend to run databricks bundle validate before deploying - it helps catch bugs with error messages and warnings when YAML isn't quite correct.

VicS · ‎11-22-2024

It took me a while to realize the distinction of the keys inside the task - so for anyone else looking into this: only one of the following keys can exist in a task definition:

      tasks:
        - task_key: ingestion_delta
          # existing_cluster_id: ${var.existing_cluster_id}       # All-purpose cluster
          job_cluster_key: job_cluster                            # Job cluster
          #environment_key: my_environment                        # serverless compute

Confusing for me is also the fact, that the dependencies / libraries are listed under different keys and in different locations:

- For the serverless, you need to use the key "dependencies" in the "environment" object.

- For the all-purpose and the job-cluster, you need to use the key "libraries" in the "task_key" object.

So for a job-cluster, the full, working, example looks like this:

resources:
  jobs:
    customer_crm:
      name: customer_crm

      job_clusters:
        - job_cluster_key: job_cluster
          new_cluster:
            spark_version: 13.3.x-scala2.12
            node_type_id: Standard_D4ds_v5
            autoscale:
                min_workers: 1
                max_workers: 4
            custom_tags:
               some_tag: "my_tag"
            apply_policy_default_values: False
            data_security_mode: USER_ISOLATION  # SINGLE_USER (="Single User"), USER_ISOLATION (="Shared"), NONE (="No isolation shared")
            init_scripts:
              - workspace:
                  destination: "/Shared/my_init_script.sh"
            spark_env_vars:
              "PYPI_TOKEN": "{{secrets/<scope-name>/<secret-name>}}"

      tasks:
        - task_key: ingestion_delta
          # existing_cluster_id: ${var.existing_cluster_id}       # All-purpose cluster
          job_cluster_key: job_cluster                            # Job cluster
          #environment_key: my_environment                        # serverless compute
          python_wheel_task:
            package_name: customer_crm
            entry_point: customer_crm_ep_argparse
            named_parameters: { application-name : "myapp" }
          libraries:  # The order of the packages is important (!)
            - pypi:
                package: another-package==1.0.0
                repo: https://pkgs.dev.azure.com/xx/xxx/_packaging/xxx/pypi/simple/
            - whl: ./dist/*.whl

Databricks Community

How to use custom whl file + pypi repo with a job cluster in asset bundles?

Photos

Join Us as a Local Community Builder!

Exciting Opportunity to Collaborate with Us!

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Share Your Thoughts on Databricks & Get Rewarded!

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April