cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How do you define PyPi libraries on job level in Asset Bundles?

jacovangelder
Honored Contributor

Hello,

Reading the documentation, it does not state it is possible to define libraries on job level instead of on task level. It feels really counter-intuitive putting libraries on task level in Databricks workflows provisioned by Asset Bundles. Is there a way to put libraries on job level some other way?

I tried the following:

tasks:
  - task_key: task1
    job_cluster_key: job_cluster
    notebook_task:
      notebook_path: ../foo.ipynb
  - task_key: task2
    depends_on:
        - task_key: task1
    job_cluster_key: job_cluster
    notebook_task:
      notebook_path: ../foo.ipynb
libraries:
- pypi:
    package: pyyaml==6.0.1
- pypi:
    package: requests==2.31.0
- pypi:
    package: typing_extensions==4.4.0

Validating the DAB it does not fail, but doesn't work either. It only works when I put the libraries object on the task key level, which feels weird to me. Does that mean we can have different libraries installed for each task? The documentation doesn't really shine any light on this. I can define all libraries on the first task, then I guess the second task will inherit them also, but this feels weird. 

I know from DBR 15.x and onwards we can use a requirements.txt workspace, but I am on DBR 14.3 LTS.

I hope someone is able to shine some light on this? 

2 REPLIES 2

Witold
Honored Contributor

This is actually the way how job clusters work, you specify dependent libraries on task level.

However, starting with DB CLI v0.222.0 you could try to use complex variables for this kind of configuration.

jacovangelder
Honored Contributor

Thanks @Witold ! Thought so. 

I decided to go with an init script where I install my dependencies rather than installing libraries. 

For future reference, this is what it looks like:

job_clusters:
  - job_cluster_key: job_cluster
    new_cluster:
      spark_version: ${var.spark_version}
      node_type_id: ${var.node_type_id}
      autoscale:
          min_workers: ${var.min_workers}
          max_workers: ${var.max_workers}
      data_security_mode: SINGLE_USER
      init_scripts:
        - workspace:
            destination: ${workspace.file_path}/resources/init-scripts/init-script.sh

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group