cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Bundled Wheel Task with Serverless Compute

mlrichmond-mill
New Contributor III

I am trying to run a wheel task as part of a bundle on serverless compute. My databricks.yml includes an artifact being constructed:

artifacts:
  nexusbricks:
    type: whl
    build: python -m build
    path: .

I then am trying to set up a job to consume it:

resources:
  jobs:         
    ingest_file:
      name: "Ingest File Data"

      parameters:
        - name: scriptName        
          default: ""

      tasks:
        - task_key: ingest_file
          environment_key: serverless_env
          python_wheel_task:
            package_name: "nexusbricks"
            entry_point: "ingestRouter"            
            parameters:
              - "--scriptName"
              - "{{job.parameters.scriptName}}"   # sys.argv[1]

      environments:
        - environment_key: serverless_env
          spec:
            environment_version: "2" 
            dependencies:
            - "${workspace.root_path}/artifacts/.internal/nexusbricks-0.1.0-py3-none-any.whl"

 

Eventually, I would like to be able to upload the wheel to a dynamic location based on its git_commit and then use any past version of the wheel in my job - but for right now I am just trying to get this simple example to work.

I know I can make this work using the libraries tag when using a traditional job cluster - but how do I get this to work for serverless?

I haven't found an example of this workflow on the forums so far - the closest I saw was someone using notebooks but notebook tasks that consume a wheel aren't the same thing as a wheel task.

1 ACCEPTED SOLUTION

Accepted Solutions

mlrichmond-mill
New Contributor III

Following up further, I went back to the "simple" case - but it still doesn't work. I can see in the logs that my library is found and loaded:

Uninstalling nexusbricks-0.1.0:
Successfully uninstalled nexusbricks-0.1.0
Note: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.
Processing /Volumes/almdev/transient/staging/artifacts/mark.richmond@milliman.com/DBX/dev/4f8943f6460ba7cd2ad50c82e50f2bebf1cf02db/.internal/nexusbricks-0.1.0-py3-none-any.whl (from -r /tmp/tmp-6746852034be4303a9151e05775bf74d-environment-requirements.txt (line 1))
Requirement already satisfied: pyspark in /local_disk0/.ephemeral_nfs/envs/pythonEnv-35cd5d82-6a03-448d-b711-7f1f603acfb9/lib/python3.11/site-packages (from nexusbricks==0.1.0->-r /tmp/tmp-6746852034be4303a9151e05775bf74d-environment-requirements.txt (line 1)) (4.1.1)
Requirement already satisfied: py4j<0.10.9.10,>=0.10.9.7 in /databricks/python3/lib/python3.11/site-packages (from pyspark->nexusbricks==0.1.0->-r /tmp/tmp-6746852034be4303a9151e05775bf74d-environment-requirements.txt (line 1)) (0.10.9.7)
Installing collected packages: nexusbricks
Successfully installed nexusbricks-0.1.0

However the wheel task still fails:

Run failed with error message
 Python wheel with name nexusbricks could not be found. Please check the driver logs for more details
Python wheel with name nexusbricks could not be found. Please check the driver logs for more details

After a bunch of chatting with the databricks agent bot - it seems to imply that wheel tasks simply *don't* work on Serverless as it suggests that you must use libraries, which requires a cluster as we previously established.

I've gone through dozens of permutations at this point and am unable to get a wheel task to run on serverless at all.

The issue above here is because environment_version: "2" apparently does not support loading the wheel correctly. Changing to environment_version: "4" resolved this issue. When you have time, I'd still appreciate answers re: artifact_path and best practices for dev/prod.

Thanks.

View solution in original post

4 REPLIES 4

Louis_Frolio
Databricks Employee
Databricks Employee

Hey @mlrichmond-mill , for serverless, you install your wheel via the job’s serverless environment dependencies — not the libraries stanza. Point the dependency at an absolute /Workspace or /Volumes path where the bundle uploaded the wheel, then run it using package_name + entry_point exactly as you’re doing now.

One gotcha: serverless caches environments, so if you keep the same wheel version or overwrite the wheel at the same path, your changes may not be picked up. Bump the version or change the path on every build.

Why your current setup isn’t running yet

  • Serverless tasks (Python wheel, Python script, dbt) require an environment_key and resolve dependencies from that environment’s spec.

    The libraries field you’d use with a cluster is ignored by serverless.

  • After bundle deploy, the wheel is uploaded into your workspace at:

 

/Workspace/Users/<you>/.bundle/<bundle>/<target>/artifacts/.internal/<wheel>.whl

You must reference that absolute path in the environment dependencies.

Minimal working example (serverless + wheel)

 

The key idea: reference the wheel from the environment, using its absolute /Workspace path.

# databricks.yml
bundle:
  name: nexusbricks-bundle

artifacts:
  nexusbricks:
    type: whl
    build: python -m build
    path: .
    # Optional, see caching notes below
    # dynamic_version: true

resources:
  jobs:
    ingest_file:
      name: "Ingest File Data"
      parameters:
        - name: scriptName
          default: ""

      tasks:
        - task_key: ingest_file
          environment_key: serverless_env   # required for serverless
          python_wheel_task:
            package_name: "nexusbricks"
            entry_point: "ingestRouter"
            parameters:
              - "--scriptName"
              - "{{job.parameters.scriptName}}"

      environments:
        - environment_key: serverless_env
          spec:
            environment_version: "2"
            dependencies:
              # Absolute workspace path to the deployed wheel
              # ${workspace.root_path} typically resolves to:
              # /Workspace/Users/<you>/.bundle/<bundle>/<target>
              - "/Workspace${workspace.root_path}/artifacts/.internal/nexusbricks-0.1.0-py3-none-any.whl"

Key points:

  • environment_key is mandatory for serverless tasks.

  • Dependencies accept pip-style specs, including absolute paths starting with /Workspace or /Volumes.

  • Bundle deploys place wheels under

    …/.bundle/<bundle>/<target>/artifacts/.internal/.

Run flow:

databricks bundle validate
databricks bundle deploy -t <target>
databricks bundle run -t <target> ingest_file

Caching and versioning (this matters)

Serverless caches environment dependencies. If you reuse the same wheel version or path, you may still be running yesterday’s code.

Best practices:

  • Bump the wheel version on every build (timestamp or git SHA), or

  • Change the artifact path on each deploy so serverless sees a “new” dependency.

You can automate this in bundles by:

  • Using dynamic_version on the artifact, or

  • Embedding the git commit into the artifact path and version.

Long-term pattern: versioned wheels in a stable location

If you want reproducibility and easy rollbacks, store wheels in a Unity Catalog Volume, versioned by commit or semver.

Example pattern:

  • Artifact path includes ${bundle.git.commit}

  • Wheel version includes commit or semver

  • Jobs pin to an exact wheel

 

workspace:
  artifact_path: /Volumes/main/shared/artifacts/${bundle.name}/${bundle.target}/${bundle.git.commit}

artifacts:
  nexusbricks:
    type: whl
    build: |
      python -c "import re, pathlib;
from pathlib import Path
py=Path('pyproject.toml')
s=py.read_text()
s=re.sub(r'version\\s*=\\s*\"([^\"]+)\"',
         lambda m: f'version = \"{m.group(1)}+${bundle.git.commit}\"', s)
py.write_text(s)"
      python -m build
    path: .

resources:
  jobs:
    ingest_file:
      environments:
        - environment_key: serverless_env
          spec:
            environment_version: "2"
            dependencies:
              - "/Volumes/main/shared/artifacts/${bundle.name}/${bundle.target}/${bundle.git.commit}/nexusbricks-0.1.0+${bundle.git.commit}-py3-none-any.whl"

This avoids cache surprises and gives you deterministic rollbacks.

Alternative: install the project directory

Serverless environments can also pip install directly from a project directory (with pyproject.toml or setup.py) stored in Workspace files or a Volume. This is fine for fast iteration, but versioned wheels are the better long-term pattern for jobs.

Quick checklist

  • Dependency path is absolute and starts with /Workspace or /Volumes

  • environment_key is set and used by the task

  • package_name and entry_point match your package metadata

  • Wheel version or path changes on every deploy

  • Don’t rely on libraries for serverless wheel installs

 

Hope this helps put you in the right direction.

Cheers, Louis.

mlrichmond-mill
New Contributor III

Based on your example, I think my code was the same as what you suggested as the "minimum working sample" - however, I think I have a stale wheel cached in the serverless setup and that may be the source of my problem there. I agree the long-term goal is the git-commit based version - but when I attempt to use what you supplied, the deploy fails for the artifact path (Error: target with 'mode: production' must set 'workspace.root_path' to make sure only one copy is deployed) because I'm still working in dev at the moment.

What is the best practice to facilitate this kind of pattern whilst getting prod and dev to peacefully coexist?
Ideally I'd like to be able to deploy to dev and have it automatically pick up my current "version" via timestamp or whatever else is necessary, and for prod use a wheel based on a git commit.

mlrichmond-mill
New Contributor III

Following up further, I went back to the "simple" case - but it still doesn't work. I can see in the logs that my library is found and loaded:

Uninstalling nexusbricks-0.1.0:
Successfully uninstalled nexusbricks-0.1.0
Note: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.
Processing /Volumes/almdev/transient/staging/artifacts/mark.richmond@milliman.com/DBX/dev/4f8943f6460ba7cd2ad50c82e50f2bebf1cf02db/.internal/nexusbricks-0.1.0-py3-none-any.whl (from -r /tmp/tmp-6746852034be4303a9151e05775bf74d-environment-requirements.txt (line 1))
Requirement already satisfied: pyspark in /local_disk0/.ephemeral_nfs/envs/pythonEnv-35cd5d82-6a03-448d-b711-7f1f603acfb9/lib/python3.11/site-packages (from nexusbricks==0.1.0->-r /tmp/tmp-6746852034be4303a9151e05775bf74d-environment-requirements.txt (line 1)) (4.1.1)
Requirement already satisfied: py4j<0.10.9.10,>=0.10.9.7 in /databricks/python3/lib/python3.11/site-packages (from pyspark->nexusbricks==0.1.0->-r /tmp/tmp-6746852034be4303a9151e05775bf74d-environment-requirements.txt (line 1)) (0.10.9.7)
Installing collected packages: nexusbricks
Successfully installed nexusbricks-0.1.0

However the wheel task still fails:

Run failed with error message
 Python wheel with name nexusbricks could not be found. Please check the driver logs for more details
Python wheel with name nexusbricks could not be found. Please check the driver logs for more details

After a bunch of chatting with the databricks agent bot - it seems to imply that wheel tasks simply *don't* work on Serverless as it suggests that you must use libraries, which requires a cluster as we previously established.

I've gone through dozens of permutations at this point and am unable to get a wheel task to run on serverless at all.

The issue above here is because environment_version: "2" apparently does not support loading the wheel correctly. Changing to environment_version: "4" resolved this issue. When you have time, I'd still appreciate answers re: artifact_path and best practices for dev/prod.

Thanks.

saurabh18cs
Honored Contributor III

Hi @mlrichmond-mill This is how we do it.

 
      environments:
        - environment_key: serverless_env_v4
          spec:
            environment_version: '4'
            dependenciesdist/*.whl
      
becasue your bundle whl gets placed into  dist/*.whl. The bundle uploader syncs dist/*.whl into workspace files behind the scenes
 
Note: Serverless cache the last used version so it will not use latest/current all time. its better to use pinned version for serverless.
 
Br