Databricks Community

noorbasha534 · ‎10-05-2024

Dear all

I like to know the best practices around libraries installation on Databricks compute - all-purpose, job.

The need is to screen the libraries, conduct vulnerability tests, and then let them be installed through a centralized CI/CD process. However, at times, few data scientists think this sort of approach doesn't scale to meet their needs.

Appreciate if there are any success stories around this.

Br,

Noor.

filipniziol · ‎10-05-2024

Hi @noorbasha534 ,

The best practice here is to install the libraries in init script and configure the cluster to run the script at start-up:

1. Create init script, for example:

#!/bin/bash
wget -P /databricks/jars/ https://repo1.maven.org/maven2/com/example/library/1.0.0/library-1.0.0.jar

2. Store init script in cloud storage

3. Configure clusters to use init script:

- In your cluster settings, under Advanced Options > Init Scripts, specify the path to your init script.

- This ensures every time a cluster starts, it installs the approved libraries automatically.

4. Integrate with CI/CD Pipelines:

- Incorporate the deployment of init scripts into your CI/CD process. This way, any updates to libraries go through screening and vulnerability checks before being deployed.

filipniziol · ‎10-05-2024

Hi @noorbasha534 ,

In my previous reply too much focus was on the init scripts, and less on the core of your question so how to conduct vulnerability tests 🙂

However, the init scripts are key here:
1. Init script is kept in the repository

2. User to add the package creates a feature branch and modifies the init script to include the new library.

3. User makes a pull request into main branch

4. Vulnerability Scans: As part of the PR process, your CI/CD pipeline automatically runs vulnerability scans on the updated init script and the new library

5. Once the PR is merged, the updated init script is used by your Databricks clusters. When clusters start or restart, they execute the init script and install the new library.

noorbasha534 · ‎10-05-2024

@filipniziol thanks again for your time. The thing is we like to block access to these URLs as at times we found developers & data scientists downloading packages that were marked as vulnerable by Maven.

filipniziol · ‎10-05-2024

Hi @noorbasha534 ,

If you want to have very strict policies:
1. Create a private artifact repository, like Azure Artifacts

2. Configure Init Scripts to Use the Private Repository

3. Block network access to public repositories

Still, I would go with what I have already described:
1. Create role for data scientists.

2. Grant only the necessary cluster permissions to the role: provide can attach to and can restart but remove can manage permission.

3. Follow the steps as described above:

Init script is kept in the repository
User to add the package creates a feature branch and modifies the init script to include the new library.
User makes a pull request into main branch
Vulnerability Scans: As part of the PR process, your CI/CD pipeline automatically runs vulnerability scans on the updated init script and the new library
Once the PR is merged, the updated init script is used by your Databricks clusters. When clusters start or restart, they execute the init script and install the new library.

With that setup, the data scientists will not be able to manually modify the cluster configurations. The only way they will be able to add the library to the cluster, will be by modifying the init script, that will be the subject to vulnerability scan.

Databricks Community

Libraries installation governance

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Portland Data + AI Meetup — Holiday Event - Wednesday, December 3rd