cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

wheel package to install in a serveless workflow

jeremy98
Contributor III

Hi guys, 
Which is the way through Databricks Asset Bundle to declare a new job definition having a serveless compute associated on each task that composes the workflow and be able that inside each notebook task definition is possible to catch the dependent custom libraries that I imported inside the workspace?

I did something like this:

      environments:
      - environment_key: envir
        spec:
          client: "1"
          dependencies:
            - "${workspace.root_path}/artifacts/.internal/data_pipelines-0.0.1-py3-none-any.whl"

      tasks:

        - task_key: schedule_next_run_for_this_job
          description: due to business requirements is needed to reschedule the workflow in the near next run
          environment_key: envir
          notebook_task:
            notebook_path: ../notebook/jobs/export.py
            base_parameters:
              function: schedule_next_run_for_this_job
              env: ${bundle.target}
              job_id: "{{job.id}}"
              workspace_url: "{{workspace.url}}"

but it returns to me:

Error: cannot create job: A task environment can not be provided for notebook task get_email_infos. Please use the %pip magic command to install notebook-scoped Python libraries and Python wheel packages


the only way to import personal wheel package inside a serveless compute is to install inside the notebook that library?

Because I want to do something like using:

libraries:
   - whl: ...

 

22 REPLIES 22

jeremy98
Contributor III

Hello @Alberto_Umana , considers that outside the workflow I can install the library, when I ran the workflow through dabs I still got error:

 

CalledProcessError: Command 'pip --disable-pip-version-check install '/Workspace/Shared/test-sync-lib/.internal/data_pipelines-0.0.1-py3-none-any.whl'' returned non-zero exit status 1.

 

and looking in detail the error still got:

 

ERROR: Package 'data-pipelines' requires a different Python: 3.10.12 not in '<4.0,>=3.11'

 

But sounds strange, since I wrote that environment field.. that will be inherit to each task automatically in theory. 

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @jeremy98,

I think it has to do with the serverless version being used outside the workflow versus in DABs, since python version changes. please see: https://docs.databricks.com/en/release-notes/serverless/index.html both the versions have different python versions which might cause dependencies issues. I am not sure how to specify the serverless version in DABs, I will check internally.

Good morning,
Thanks for the answer, yes please let me know because I found the solution to declare it in the higher level but seems still that doesn't catch the environment inside each task, but If I look the task structure there, there is the environment set but doesn't work

jeremy98
Contributor III

@Alberto_Umana, one of my colleague did it using a spark_python_task... maybe this is something only for certain types of files?

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @jeremy98,

When you mentioned using a spark_python_task did it work using serverless too?

Hello,

 

 

      tasks:
        - task_key: batch_inference
          description: "Trigger batch inference processing"
          spark_python_task:
            python_file: "../py_scripts/infge.py"  # Path to your Python script
            parameters: ["--function", "run_batch_workflow", "--env", "${bundle.target}"]
          environment_key: default  # Reference the environment specification
          timeout_seconds: 6000  # 100 minutes timeout for the task

      environments:
        - environment_key: default
          spec:
            client: "1"
            dependencies:
              - azure-batch==14.2.0
              - azure-identity==1.19.0
              - azure-keyvault-secrets==4.9.0

 

 

 


He did in this way, but the libraries he needs are only the ones the you see in dependencies. So, different from my side. 

Is it also true that the installation of this libraries are done only once, although we are in the serveless mode?

Hello @Alberto_Umana , news?

jeremy98
Contributor III

Ping @Alberto_Umana 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now