Is it possible to have Cluster with pre-installed dependencies?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-25-2023 07:54 AM
I run some jobs in the Databricks environment where some resources need authentication. I do this (and I need to) through the vault-cli in the init-script.
However, every time in the init-script I need to install vault-cli and other libraries. Is there any way I can have them pre-installed somehow? I would like to avoid this installation every time I run a job
- Labels:
-
Cluster
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-26-2023 09:58 PM
@João Victor Albuquerque :
Yes, there are a few ways to pre-install libraries and tools in the Databricks environment:
- Cluster-scoped init scripts: You can specify a shell script to be run when a cluster is created or restarted. This script can include commands to install libraries and tools using package managers like pip or apt-get. This way, every time a cluster starts, the required packages will be pre-installed.
- Databricks environments: You can create a Databricks environment that includes the required libraries and tools. An environment is a versioned set of libraries, and you can specify the environment to use when creating or starting a cluster. This way, every time a cluster starts, it will have the required environment pre-installed.
- Custom container images: You can create a custom Docker container image with the required libraries and tools pre-installed. You can then use this container image as the base image for your Databricks clusters. This way, every time a cluster starts, it will use the custom container image with the required packages pre-installed.
You can choose the approach that best fits your needs and preferences.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-28-2023 10:00 AM
I currently use this first option (init-scripts). But my intention is not to need to be installing every time a cluster starts. I wanted one with the libraries already installed in the environment. It seems to me that the 2nd and 3rd option would have that. Is there any documentation for them? Especially the second option. I didn't find anything about it

