Libraries installation governance
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-05-2024 12:19 PM
Dear all
I like to know the best practices around libraries installation on Databricks compute - all-purpose, job.
The need is to screen the libraries, conduct vulnerability tests, and then let them be installed through a centralized CI/CD process. However, at times, few data scientists think this sort of approach doesn't scale to meet their needs.
Appreciate if there are any success stories around this.
Br,
Noor.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-05-2024 12:33 PM - edited 10-05-2024 12:39 PM
Hi @noorbasha534 ,
The best practice here is to install the libraries in init script and configure the cluster to run the script at start-up:
1. Create init script, for example:
#!/bin/bash
wget -P /databricks/jars/ https://repo1.maven.org/maven2/com/example/library/1.0.0/library-1.0.0.jar
2. Store init script in cloud storage
3. Configure clusters to use init script:
- In your cluster settings, under Advanced Options > Init Scripts, specify the path to your init script.
- This ensures every time a cluster starts, it installs the approved libraries automatically.
4. Integrate with CI/CD Pipelines:
- Incorporate the deployment of init scripts into your CI/CD process. This way, any updates to libraries go through screening and vulnerability checks before being deployed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-05-2024 12:47 PM
Hi @noorbasha534 ,
In my previous reply too much focus was on the init scripts, and less on the core of your question so how to conduct vulnerability tests 🙂
However, the init scripts are key here:
1. Init script is kept in the repository
2. User to add the package creates a feature branch and modifies the init script to include the new library.
3. User makes a pull request into main branch
4. Vulnerability Scans: As part of the PR process, your CI/CD pipeline automatically runs vulnerability scans on the updated init script and the new library
5. Once the PR is merged, the updated init script is used by your Databricks clusters. When clusters start or restart, they execute the init script and install the new library.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-05-2024 12:50 PM
@filipniziol thanks again for your time. The thing is we like to block access to these URLs as at times we found developers & data scientists downloading packages that were marked as vulnerable by Maven.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-05-2024 01:12 PM - edited 10-05-2024 01:13 PM
Hi @noorbasha534 ,
If you want to have very strict policies:
1. Create a private artifact repository, like Azure Artifacts
2. Configure Init Scripts to Use the Private Repository
3. Block network access to public repositories
Still, I would go with what I have already described:
1. Create role for data scientists.
2. Grant only the necessary cluster permissions to the role: provide can attach to and can restart but remove can manage permission.
3. Follow the steps as described above:
- Init script is kept in the repository
- User to add the package creates a feature branch and modifies the init script to include the new library.
- User makes a pull request into main branch
- Vulnerability Scans: As part of the PR process, your CI/CD pipeline automatically runs vulnerability scans on the updated init script and the new library
- Once the PR is merged, the updated init script is used by your Databricks clusters. When clusters start or restart, they execute the init script and install the new library.
With that setup, the data scientists will not be able to manually modify the cluster configurations. The only way they will be able to add the library to the cluster, will be by modifying the init script, that will be the subject to vulnerability scan.

