cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Install Python dependency on job cluster from a privately hosted GitLab repository (HTTPS/SSH)

andre-h
New Contributor
Hello,
We intend to deploy a Databricks workflow based on a Python wheel file which needs to run on a job cluster. There is a dependency declared in pyproject.toml which is another Python project living in a private Gitlab repository. We therefore need to provide secure access to our Gitlab domain to the cluster.

We do not want to declare the dependency URL in pyproject.toml/requirements.txt in a form containing credentials. The .whl file metadata needs to be devoid of any credentials.

However, the library still needs to be declared as a dependency of the main code somehow, for proper downstream dependency management.
 
There are two repository-specific ways to do this in an automated fashion: Either via a deploy key, which is a project-specific SSH key, or via a deploy token for HTTPS access. Both ways would work, provided we can get these credentials to be usable on the nodes. Unfortunately, it is being difficult.
 
This is how I have proceeded so far to try out both ways (still not super secure, as working with plain-text secrets, but these are first steps):
 

Step 1

I created a deploy key pair and added the public key to the Gitlab project. I also created a deploy token.
 

Step 2

I created a secret scope containing the secrets (private key of deploy key pair and deploy token username/token pair) and configured the job cluster to contain the following environment variables, so they can be used in initialization:

 

 

 

MY_DEPLOY_KEY: '{{secrets/<MY_SECRET_SCOPE>/my-deploy-key}}'
MY_DEPLOY_TOKEN: '{{secrets/<MY_SECRET_SCOPE>/my-deploy-token}}'
MY_DEPLOY_TOKEN_USER: '{{secrets/<MY_SECRET_SCOPE>/my-deploy-token-user}}'

 

 

 

Step 3a (SSH)

The job cluster init script configures SSH to use the MY_DEPLOY_KEY variable for our Gitlab:

 

 

 

mkdir -p /root/.ssh/gitlab
echo "${MY_DEPLOY_KEY}" > /root/.ssh/gitlab/id_rsa  # Add private key
ssh-keyscan gitlab > /root/.ssh/known_hosts  # Get host keys from server

# Configure SSH to use the private key for Gitlab:
cat << EOL > /root/.ssh/config
Host gitlab
  HostName <OUR_GITLAB_HOSTNAME>
  GlobalKnownHostsFile=/root/.ssh/known_hosts
  User git
  IdentityFile /root/.ssh/gitlab/id_rsa
EOL
chmod 600 /root/.ssh/gitlab/id_rsa /root/.ssh/known_hosts /root/.ssh/config

 

 

 

 
When running this in a cell on an interactive cluster, this works, I can access the repository. It also works on my local computer. The job cluster however fails to install the dependency from the repository because the host cannot be verified. This is the log4j output:

 

 

 

Host key verification failed.
  fatal: Could not read from remote repository.

  Please make sure you have the correct access rights
  and the repository exists.
  error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet 'ssh://****@<OUR_GITLAB_DOMAIN>/<...>.git' /tmp/pip-install-dpz6wll5/<...> did not run successfully.
  │ exit code: 128
  ╰─> See above for output.

 

 

 

 

Step 3b (HTTPS)

The job cluster init script configures Git to use the MY_DEPLOY_TOKEN_USER and MY_DEPLOY_TOKEN variables for Gitlab:

 

 

 

git config --global --add credential.helper store
cat << EOL > /root/.git-credentials
https://${MY_DEPLOY_TOKEN_USER}:${MY_DEPLOY_TOKEN}@<OUR_GITLAB_HOSTNAME>
EOL
chmod 600 /root/.git-credentials

 

 

 

Again, this works when executed on an interactive cluster, but on a job cluster, the library installation fails. Log4j output:

 

 

 

fatal: could not read Username for 'https://<OUR_GITLAB_HOSTNAME>': No such device or address
  error: subprocess-exited-with-error

 

 

 

Step 4 (further checks)

I also performed the following checks:

  • The files created by the init scripts do not get deleted after the script finishes executing, so the environment does not get "refreshed" between init script execution and wheel file attachment to the job.
  • Installing the libraries directly from inside the init scripts works when setting up the credentials in these two ways. However, this is not how we want to do it.
  • Installing the libraries when the main code is already running also works (e.g. running subprocess(pip install ...)).
A different method I thought of but did not test yet is the package registry feature offered by Gitlab. A built artifact can be registered there and one could create a pip.conf file on the node with the registry URL and its credentials as an extra URL. However, I doubt that this works, because under the hood, I guess the same commands are executed.
 
Question:
What is happening between init script execution and wheel file attachment which could block the access to stored Git credentials on a job cluster?
What are best practices for securely accessing private repositories on job clusters?
2 REPLIES 2

Walter_C
Databricks Employee
Databricks Employee

Between the execution of the init script and the wheel file attachment on a job cluster, there are several factors that could block access to stored Git credentials:

  1. Environment Isolation: Job clusters are designed to be ephemeral and isolated. This means that any environment setup done in the init script might not persist or be accessible when the job runs. This isolation ensures that each job runs in a clean environment, which can lead to the loss of any temporary configurations or credentials set up during the init script execution.

  2. Credential Storage: The credentials set up in the init script might not be stored in a way that they are accessible to the job tasks. For example, if the credentials are written to a file, the job tasks might not have the necessary permissions or paths to access these files.

  3. Network Configuration: The network configuration on job clusters might be different from interactive clusters. This can affect the ability to verify host keys or access external repositories, leading to issues like "Host key verification failed."

  4. Security Policies: Job clusters might have stricter security policies that prevent the use of certain credentials or access methods. This can include restrictions on SSH keys or HTTPS tokens, leading to failures in accessing private repositories.

Best Practices for Securely Accessing Private Repositories on Job Clusters:

  1. Use Databricks Secrets: Store your Git credentials (SSH keys or HTTPS tokens) in Databricks Secrets. This ensures that the credentials are securely managed and can be accessed by the job tasks without being exposed in the init scripts.

  2. Environment Variables: Use environment variables to pass credentials to the job tasks. This can be done by setting the environment variables in the init script and ensuring that the job tasks are configured to read these variables.

  3. Databricks Repos: Use Databricks Repos to manage your code. Databricks Repos integrates with Git providers and handles the authentication and access management, reducing the need to manually manage credentials.

  4. Cluster Policies: Define cluster policies that ensure the necessary configurations and credentials are set up correctly for job clusters. This can help enforce consistent and secure access to private repositories.

  5. Package Registry: Consider using a package registry feature offered by GitLab. You can register built artifacts and create a pip.conf file on the node with the registry URL and its credentials as an extra URL. This method can help manage dependencies more securely and efficiently.


@Walter_C wrote:
  1. Environment Isolation: Job clusters are designed to be ephemeral and isolated. This means that any environment setup done in the init script might not persist or be accessible when the job runs. This isolation ensures that each job runs in a clean environment, which can lead to the loss of any temporary configurations or credentials set up during the init script execution.


I think that this is the actual cause, but it would be great to get a deterministic statement regarding this.

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group