cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Libraries installation governance

noorbasha534
Contributor

Dear all

I like to know the best practices around libraries installation on Databricks compute - all-purpose, job.

The need is to screen the libraries, conduct vulnerability tests, and then let them be installed through a centralized CI/CD process. However, at times, few data scientists think this sort of approach doesn't scale to meet their needs.

Appreciate if there are any success stories around this.

Br,

Noor.

4 REPLIES 4

filipniziol
Contributor III

Hi @noorbasha534 ,

The best practice here is to install the libraries in init script and configure the cluster to run the script at start-up:

1. Create init script, for example:

 

#!/bin/bash
wget -P /databricks/jars/ https://repo1.maven.org/maven2/com/example/library/1.0.0/library-1.0.0.jar

 

2. Store init script in cloud storage

3. Configure clusters to use init script:

          - In your cluster settings, under Advanced Options > Init Scripts, specify the path to your init script.

filipniziol_1-1728156702550.png

          - This ensures every time a cluster starts, it installs the approved libraries automatically.

4. Integrate with CI/CD Pipelines:

         - Incorporate the deployment of init scripts into your CI/CD process. This way, any updates to libraries go through screening and vulnerability checks before being deployed.

filipniziol
Contributor III

Hi @noorbasha534 ,

In my previous reply too much focus was on the init scripts, and less on the core of your question so how to conduct vulnerability tests ๐Ÿ™‚

However, the init scripts are key here:
1. Init script is kept in the repository

2. User to add the package creates a feature branch and modifies the init script to include the new library.

3. User makes a pull request into main branch

4. Vulnerability Scans: As part of the PR process, your CI/CD pipeline automatically runs vulnerability scans on the updated init script and the new library

5. Once the PR is merged, the updated init script is used by your Databricks clusters. When clusters start or restart, they execute the init script and install the new library.

noorbasha534
Contributor

@filipniziol thanks again for your time. The thing is we like to block access to these URLs as at times we found developers & data scientists downloading packages that were marked as vulnerable by Maven.

Hi @noorbasha534 ,

If you want to have very strict policies:
1. Create a private artifact repository, like Azure Artifacts 

2. Configure Init Scripts to Use the Private Repository

3. Block network access to public repositories 

Still, I would go with what I have already described:
1. Create role for data scientists.

2. Grant only the necessary cluster permissions to the role: provide can attach to and can restart but remove can manage permission.

3. Follow the steps as described above:

  • Init script is kept in the repository
  •  User to add the package creates a feature branch and modifies the init script to include the new library.
  • User makes a pull request into main branch
  • Vulnerability Scans: As part of the PR process, your CI/CD pipeline automatically runs vulnerability scans on the updated init script and the new library
  • Once the PR is merged, the updated init script is used by your Databricks clusters. When clusters start or restart, they execute the init script and install the new library.

With that setup, the data scientists will not be able to manually modify the cluster configurations. The only way they will be able to add the library to the cluster, will be by modifying the init script, that will be the subject to vulnerability scan.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group