Installing Python packages on Databricks serverless compute via asset bundles is possible, but there are some unique limitations and required configuration adjustments compared to traditional jobs or job tasks. The core methods to install packages for serverless workloads involve either asset bundles’ environment sections or using Python wheel files for dependencies.
Key Findings
-
Asset Bundles and Environments: To add third-party libraries to DLT serverless pipelines, you must use the environments section within your asset bundle definition. However, simply specifying the dependencies in the environment block isn’t enough; you need to explicitly reference the environment in the task itself. Without this reference, your custom or external packages are not correctly installed at runtime.
-
Linking Environment to Task: The environment key defined under environments must be linked in your pipeline/job task using the environment_key. This ensures your pipeline attempts to pull in the dependencies you listed.
-
Supported Package Types: Installing packages via asset bundles is most predictable when you package dependencies as Python wheel files (.whl) and list them in the environment’s dependencies property. For pip/conda-style installations, support may vary, and pip-installing directly from PyPI within the configuration may not always work as seamlessly on serverless compute compared to standard clusters.
-
Manual Install Still Works: You can still install packages at runtime in notebooks using %pip install ..., but this defeats full automation and reproducibility via asset bundles.
-
Limitations: JAR/Maven packages and direct custom data source connections are not supported on serverless; support is Python-centric.
Recommended Solution
Update your job/task configuration as follows:
environments:
- environment_key: default
spec:
client: "1"
dependencies:
- pandera
resources:
jobs:
xyz:
name: x_y_z
tasks:
- task_key: PipelineTask
pipeline_task:
pipeline_id: ${resources.pipelines.my_pipeline.id}
environment_key: default # <-- Link environment here
This binding ensures the default environment (which lists pandera as a dependency) is actually used when the pipeline runs.
Alternative (Wheel Packaging)
If you have more complex dependencies or custom code, pre-package your dependencies (or your code and dependencies) as a wheel file and reference them in your bundle, which is well-supported and robust:
environments:
- environment_key: myenv
spec:
dependencies:
- dist/my_package-0.1.0-py3-none-any.whl
# Reference the environment_key in the task as shown above.
Summary Table
| Installation Approach |
Works on Serverless? |
Notes |
pip in notebook |
Yes |
Manual, not reproducible |
| Asset bundle, env not linked |
No |
Must link environment_key |
| Asset bundle with wheel file |
Yes |
Best for custom code |
| Asset bundle w/ PyPI in env |
Yes (if linked) |
Use dependencies block |
| JAR/Maven dependencies |
No |
Not supported |
For best results, package dependencies in a wheel, reference it in your bundle environment, and always link your environment_key in your job/task definition. If your use case is still not supported, consider manual %pip install in a notebook or check for any new Databricks documentation regarding serverless package management.