mark_ott
Databricks Employee
Databricks Employee

Installing Python packages on Databricks serverless compute via asset bundles is possible, but there are some unique limitations and required configuration adjustments compared to traditional jobs or job tasks. The core methods to install packages for serverless workloads involve either asset bundles’ environment sections or using Python wheel files for dependencies.

Key Findings

  • Asset Bundles and Environments: To add third-party libraries to DLT serverless pipelines, you must use the environments section within your asset bundle definition. However, simply specifying the dependencies in the environment block isn’t enough; you need to explicitly reference the environment in the task itself. Without this reference, your custom or external packages are not correctly installed at runtime.​

  • Linking Environment to Task: The environment key defined under environments must be linked in your pipeline/job task using the environment_key. This ensures your pipeline attempts to pull in the dependencies you listed.​

  • Supported Package Types: Installing packages via asset bundles is most predictable when you package dependencies as Python wheel files (.whl) and list them in the environment’s dependencies property. For pip/conda-style installations, support may vary, and pip-installing directly from PyPI within the configuration may not always work as seamlessly on serverless compute compared to standard clusters.​

  • Manual Install Still Works: You can still install packages at runtime in notebooks using %pip install ..., but this defeats full automation and reproducibility via asset bundles.​

  • Limitations: JAR/Maven packages and direct custom data source connections are not supported on serverless; support is Python-centric.​

Recommended Solution

Update your job/task configuration as follows:

text
environments: - environment_key: default spec: client: "1" dependencies: - pandera resources: jobs: xyz: name: x_y_z tasks: - task_key: PipelineTask pipeline_task: pipeline_id: ${resources.pipelines.my_pipeline.id} environment_key: default # <-- Link environment here

This binding ensures the default environment (which lists pandera as a dependency) is actually used when the pipeline runs.​

Alternative (Wheel Packaging)

If you have more complex dependencies or custom code, pre-package your dependencies (or your code and dependencies) as a wheel file and reference them in your bundle, which is well-supported and robust:

text
environments: - environment_key: myenv spec: dependencies: - dist/my_package-0.1.0-py3-none-any.whl # Reference the environment_key in the task as shown above.

Summary Table

Installation Approach Works on Serverless? Notes
pip in notebook Yes Manual, not reproducible
Asset bundle, env not linked No Must link environment_key
Asset bundle with wheel file Yes Best for custom code
Asset bundle w/ PyPI in env Yes (if linked) Use dependencies block
JAR/Maven dependencies No Not supported
 
 

For best results, package dependencies in a wheel, reference it in your bundle environment, and always link your environment_key in your job/task definition. If your use case is still not supported, consider manual %pip install in a notebook or check for any new Databricks documentation regarding serverless package management.​