cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Is it possible to have Cluster with pre-installed dependencies?

joao_albuquerqu
New Contributor II

I run some jobs in the Databricks environment where some resources need authentication. I do this (and I need to) through the vault-cli in the init-script.

However, every time in the init-script I need to install vault-cli and other libraries. Is there any way I can have them pre-installed somehow? I would like to avoid this installation every time I run a job

2 REPLIES 2

Anonymous
Not applicable

@João Victor Albuquerque​ :

Yes, there are a few ways to pre-install libraries and tools in the Databricks environment:

  1. Cluster-scoped init scripts: You can specify a shell script to be run when a cluster is created or restarted. This script can include commands to install libraries and tools using package managers like pip or apt-get. This way, every time a cluster starts, the required packages will be pre-installed.
  2. Databricks environments: You can create a Databricks environment that includes the required libraries and tools. An environment is a versioned set of libraries, and you can specify the environment to use when creating or starting a cluster. This way, every time a cluster starts, it will have the required environment pre-installed.
  3. Custom container images: You can create a custom Docker container image with the required libraries and tools pre-installed. You can then use this container image as the base image for your Databricks clusters. This way, every time a cluster starts, it will use the custom container image with the required packages pre-installed.

You can choose the approach that best fits your needs and preferences.

I currently use this first option (init-scripts). But my intention is not to need to be installing every time a cluster starts. I wanted one with the libraries already installed in the environment. It seems to me that the 2nd and 3rd option would have that. Is there any documentation for them? Especially the second option. I didn't find anything about it

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.