cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

wheel package to install in a serveless workflow

jeremy98
Contributor

Hi guys, 
Which is the way through Databricks Asset Bundle to declare a new job definition having a serveless compute associated on each task that composes the workflow and be able that inside each notebook task definition is possible to catch the dependent custom libraries that I imported inside the workspace?

I did something like this:

      environments:
      - environment_key: envir
        spec:
          client: "1"
          dependencies:
            - "${workspace.root_path}/artifacts/.internal/data_pipelines-0.0.1-py3-none-any.whl"

      tasks:

        - task_key: schedule_next_run_for_this_job
          description: due to business requirements is needed to reschedule the workflow in the near next run
          environment_key: envir
          notebook_task:
            notebook_path: ../notebook/jobs/export.py
            base_parameters:
              function: schedule_next_run_for_this_job
              env: ${bundle.target}
              job_id: "{{job.id}}"
              workspace_url: "{{workspace.url}}"

but it returns to me:

Error: cannot create job: A task environment can not be provided for notebook task get_email_infos. Please use the %pip magic command to install notebook-scoped Python libraries and Python wheel packages


the only way to import personal wheel package inside a serveless compute is to install inside the notebook that library?

Because I want to do something like using:

libraries:
   - whl: ...

 

22 REPLIES 22

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @jeremy98,

It appears that you are trying to use an environment block to specify dependencies for a notebook task, but this approach is not supported for notebook tasks on serverless compute. Instead, you should use the %pip magic command within the notebook to install the required libraries.

 

  • Create the job definition with the necessary tasks. Each task should specify the notebook path and any parameters required
  • Use the %pip magic command inside each notebook to install the custom libraries. This ensures that the libraries are available in the notebook's environment when the task runs.

 

Here’s an example:

 

bundle:

  name: my-bundle

resources:

  jobs:

    my-job:

      name: my-job

      tasks:

        - task_key: schedule_next_run_for_this_job

          description: due to business requirements is needed to reschedule the workflow in the near next run

          notebook_task:

            notebook_path: /Workspace/Users/your_username/notebook/jobs/export.py

            base_parameters:

              function: schedule_next_run_for_this_job

              env: ${bundle.target}

              job_id: "{{job.id}}"

              workspace_url: "{{workspace.url}}"

targets:

  dev:

    default: true

    resources:

      jobs:

        my-job:

          name: my-job

 

The example content of export.py:

 

 

# Install custom libraries using %pip magic command

%pip install /Workspace/Shared/Path/To/your_custom_library.whl

 

# Your notebook code here

def schedule_next_run_for_this_job():

    # Function implementation

    pass

 

# Call the function with parameters

schedule_next_run_for_this_job()

Hi,
Thanks for this answer! But, any import code from the wheel package should be imported like this for example?

 

 

from data_pipelines.core.utils.filters import (
    filter_by_time_granularity
)

 

 

Alberto_Umana
Databricks Employee
Databricks Employee

Yes, you can import code from a wheel package in your notebook just like you would with any other Python module. Once you have installed the wheel package using %pip, you can import the functions or classes from the package.

For example, if your wheel package contains a module data_pipelines.core.utils.filters and you want to import the filter_by_time_granularity function, you can do it as follows

%pip install /Workspace/Shared/Path/To/your_custom_library.whl

from data_pipelines.core.utils.filters import filter_by_time_granularity

Hi, mmm ok but how to upload a wheel package at every deployed dab? Because I did it in this way:

artifacts:
  lib:
    type: whl
    build: poetry build
    path: .

sync:
  include:
    - ./dist/*.whl

But this, will deploy the wheel package to my personal root_path:

  stg:
    default: true
    workspace: 
      host: <host-id>
      root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}

how to specify that the wheel package needs to be uploaded every time in a shared location?

And another question, when doing the installation of the python wheel in the serveless compute is possible also to specify the version of python on the serveless compute? Because I tried to do it, but it says: ERROR: Package 'data-pipelines' requires a different Python: 3.10.12 not in '<4.0,>=3.11'

Alberto_Umana
Databricks Employee
Databricks Employee

You can define the artifact_path in the workspace mapping:

This path should be a shared location accessible by all users who need to use the wheel package

 

bundle:

  name: my-bundle

artifacts:

  lib:

    type: whl

    build: poetry build

    path: .

sync:

  include:

    - ./dist/*.whl

workspace:

  artifact_path: /Workspace/Shared/Path/To/Shared/Location/.bundle/${bundle.name}/${bundle.target}

targets:

  stg:

    default: true

    workspace:

      host: <host-id>

      root_path: /Workspace/Shared/Path/To/Shared/Location/.bundle/${bundle.name}/${bundle.target}

artifact_path: This specifies the path where the artifacts (wheel packages) will be stored in the workspace. By setting it to a shared location, you ensure that the wheel package is accessible to all users

Alberto_Umana
Databricks Employee
Databricks Employee

About your second question, it is not possible to specify the Python version directly during the installation of a python wheel, the serveless runtime would have the build-in python version and if we upgrade it or downgrade it it make break the system due to dependencies.

Hi, thanks for your answers really helpful. But, this means that I should find a way to downgrade the python version specified in my pyproject.toml (and match it with all of my dependencies)? In order to be able to run the package in any serveless cluster?

Because, I don't know which python version I will have every time, right?

Alberto_Umana
Databricks Employee
Databricks Employee

Hi, no problem! and serverless will use the latest DBR version mentioned here: https://docs.databricks.com/en/release-notes/serverless/index.html#version-154 based upon that the python version being used. In this case DBR 15.4 LTS which uses. So we need to refactor any dependencies to be compatible with that python verison, and keep on checking if any release update that comes with a different DBR/Python version

  • Python: 3.11.0

Hi Alberto, thanks for the answer again, I don't understand your point. I mean you said that the actual cluster is working also for Python 3.11, but seems that when I was catching a new serverless cluster this hasn't a python 3.11 version but less. What do I need to do?

Alberto_Umana
Databricks Employee
Databricks Employee

Hey Jeremy, serverless should be using 3.11 too, do you see a different version? serverless should pick DBR version 15.4 which using 3.11 based on  https://docs.databricks.com/en/release-notes/serverless/index.html#version-154 

Alberto_Umana
Databricks Employee
Databricks Employee

Oh I see above error Python: 3.10.12 not in '<4.0,>=3.11' and I just tested it and indeed using 3.10, let me check

Alberto_Umana
Databricks Employee
Databricks Employee

I see the reason now, there are 2 versions of serverless one uses 1 - 3.10.12 and 2 uses the 3.11, please see: https://docs.databricks.com/en/release-notes/serverless/client-two.html

Alberto_Umana_0-1736788376698.png

 

Hi,
Thanks again for the answer :), ok, but Do I need to import environment field as I did before? Consider that I'm using DABs

Like this?

 

      environments: 
        - environment_key: env_for_data_pipelines_whl
          spec: 
            client: "2"

edit: I did it before defining the tasks, in this way each task will inherit the environment client specific, but it isn't set.. still have the same problem

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group