Databricks

joao_albuquerqu · ‎04-25-2023

I run some jobs in the Databricks environment where some resources need authentication. I do this (and I need to) through the vault-cli in the init-script.

However, every time in the init-script I need to install vault-cli and other libraries. Is there any way I can have them pre-installed somehow? I would like to avoid this installation every time I run a job

Anonymous · ‎04-26-2023

@João Victor Albuquerque :

Yes, there are a few ways to pre-install libraries and tools in the Databricks environment:

Cluster-scoped init scripts: You can specify a shell script to be run when a cluster is created or restarted. This script can include commands to install libraries and tools using package managers like pip or apt-get. This way, every time a cluster starts, the required packages will be pre-installed.
Databricks environments: You can create a Databricks environment that includes the required libraries and tools. An environment is a versioned set of libraries, and you can specify the environment to use when creating or starting a cluster. This way, every time a cluster starts, it will have the required environment pre-installed.
Custom container images: You can create a custom Docker container image with the required libraries and tools pre-installed. You can then use this container image as the base image for your Databricks clusters. This way, every time a cluster starts, it will use the custom container image with the required packages pre-installed.

You can choose the approach that best fits your needs and preferences.

joao_albuquerqu · ‎04-28-2023

I currently use this first option (init-scripts). But my intention is not to need to be installing every time a cluster starts. I wanted one with the libraries already installed in the environment. It seems to me that the 2nd and 3rd option would have that. Is there any documentation for them? Especially the second option. I didn't find anything about it

Databricks

Is it possible to have Cluster with pre-installed dependencies?

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs