Install Python dependency on job cluster from a privately hosted GitLab repository (HTTPS/SSH)

Administration & Architecture

Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.

Hello,
We intend to deploy a Databricks workflow based on a Python wheel file which needs to run on a job cluster. There is a dependency declared in pyproject.toml which is another Python project living in a private Gitlab repository. We therefore need to provide secure access to our Gitlab domain to the cluster.

We do not want to declare the dependency URL in pyproject.toml/requirements.txt in a form containing credentials. The .whl file metadata needs to be devoid of any credentials.

However, the library still needs to be declared as a dependency of the main code somehow, for proper downstream dependency management.

There are two repository-specific ways to do this in an automated fashion: Either via a deploy key, which is a project-specific SSH key, or via a deploy token for HTTPS access. Both ways would work, provided we can get these credentials to be usable on the nodes. Unfortunately, it is being difficult.

This is how I have proceeded so far to try out both ways (still not super secure, as working with plain-text secrets, but these are first steps):

Step 1

I created a deploy key pair and added the public key to the Gitlab project. I also created a deploy token.

Step 2

I created a secret scope containing the secrets (private key of deploy key pair and deploy token username/token pair) and configured the job cluster to contain the following environment variables, so they can be used in initialization:

MY_DEPLOY_KEY: '{{secrets/<MY_SECRET_SCOPE>/my-deploy-key}}'
MY_DEPLOY_TOKEN: '{{secrets/<MY_SECRET_SCOPE>/my-deploy-token}}'
MY_DEPLOY_TOKEN_USER: '{{secrets/<MY_SECRET_SCOPE>/my-deploy-token-user}}'

Step 3a (SSH)

The job cluster init script configures SSH to use the MY_DEPLOY_KEY variable for our Gitlab:

mkdir -p /root/.ssh/gitlab
echo "${MY_DEPLOY_KEY}" > /root/.ssh/gitlab/id_rsa  # Add private key
ssh-keyscan gitlab > /root/.ssh/known_hosts  # Get host keys from server

# Configure SSH to use the private key for Gitlab:
cat << EOL > /root/.ssh/config
Host gitlab
  HostName <OUR_GITLAB_HOSTNAME>
  GlobalKnownHostsFile=/root/.ssh/known_hosts
  User git
  IdentityFile /root/.ssh/gitlab/id_rsa
EOL
chmod 600 /root/.ssh/gitlab/id_rsa /root/.ssh/known_hosts /root/.ssh/config

When running this in a cell on an interactive cluster, this works, I can access the repository. It also works on my local computer. The job cluster however fails to install the dependency from the repository because the host cannot be verified. This is the log4j output:

Host key verification failed.
  fatal: Could not read from remote repository.

  Please make sure you have the correct access rights
  and the repository exists.
  error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet 'ssh://****@<OUR_GITLAB_DOMAIN>/<...>.git' /tmp/pip-install-dpz6wll5/<...> did not run successfully.
  │ exit code: 128
  ╰─> See above for output.

Step 3b (HTTPS)

The job cluster init script configures Git to use the MY_DEPLOY_TOKEN_USER and MY_DEPLOY_TOKEN variables for Gitlab:

git config --global --add credential.helper store
cat << EOL > /root/.git-credentials
https://${MY_DEPLOY_TOKEN_USER}:${MY_DEPLOY_TOKEN}@<OUR_GITLAB_HOSTNAME>
EOL
chmod 600 /root/.git-credentials

Again, this works when executed on an interactive cluster, but on a job cluster, the library installation fails. Log4j output:

fatal: could not read Username for 'https://<OUR_GITLAB_HOSTNAME>': No such device or address
  error: subprocess-exited-with-error

Step 4 (further checks)

I also performed the following checks:

The files created by the init scripts do not get deleted after the script finishes executing, so the environment does not get "refreshed" between init script execution and wheel file attachment to the job.
Installing the libraries directly from inside the init scripts works when setting up the credentials in these two ways. However, this is not how we want to do it.
Installing the libraries when the main code is already running also works (e.g. running subprocess(pip install ...)).

A different method I thought of but did not test yet is the package registry feature offered by Gitlab. A built artifact can be registered there and one could create a pip.conf file on the node with the registry URL and its credentials as an extra URL. However, I doubt that this works, because under the hood, I guess the same commands are executed.

Question:

What is happening between init script execution and wheel file attachment which could block the access to stored Git credentials on a job cluster?

What are best practices for securely accessing private repositories on job clusters?

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @hari-prasad ,

Thanks, that sounds like a very good solution as well 🙂

I managed to get it to run by using GitLab package registry as a private pypi and creating a pip.conf file with the credentials for http access in the initialization script. As I wrote, I wouldn't have expected it to work, but apparently this is the only way to make your custom library an integral part of your dependencies.

View solution in original post

4 REPLIES 4

Between the execution of the init script and the wheel file attachment on a job cluster, there are several factors that could block access to stored Git credentials:

Environment Isolation: Job clusters are designed to be ephemeral and isolated. This means that any environment setup done in the init script might not persist or be accessible when the job runs. This isolation ensures that each job runs in a clean environment, which can lead to the loss of any temporary configurations or credentials set up during the init script execution.
Credential Storage: The credentials set up in the init script might not be stored in a way that they are accessible to the job tasks. For example, if the credentials are written to a file, the job tasks might not have the necessary permissions or paths to access these files.
Network Configuration: The network configuration on job clusters might be different from interactive clusters. This can affect the ability to verify host keys or access external repositories, leading to issues like "Host key verification failed."
Security Policies: Job clusters might have stricter security policies that prevent the use of certain credentials or access methods. This can include restrictions on SSH keys or HTTPS tokens, leading to failures in accessing private repositories.

Best Practices for Securely Accessing Private Repositories on Job Clusters:

Use Databricks Secrets: Store your Git credentials (SSH keys or HTTPS tokens) in Databricks Secrets. This ensures that the credentials are securely managed and can be accessed by the job tasks without being exposed in the init scripts.
Environment Variables: Use environment variables to pass credentials to the job tasks. This can be done by setting the environment variables in the init script and ensuring that the job tasks are configured to read these variables.
Databricks Repos: Use Databricks Repos to manage your code. Databricks Repos integrates with Git providers and handles the authentication and access management, reducing the need to manually manage credentials.
Cluster Policies: Define cluster policies that ensure the necessary configurations and credentials are set up correctly for job clusters. This can help enforce consistent and secure access to private repositories.
Package Registry: Consider using a package registry feature offered by GitLab. You can register built artifacts and create a pip.conf file on the node with the registry URL and its credentials as an extra URL. This method can help manage dependencies more securely and efficiently.

@Walter_C wrote:
Environment Isolation: Job clusters are designed to be ephemeral and isolated. This means that any environment setup done in the init script might not persist or be accessible when the job runs. This isolation ensures that each job runs in a clean environment, which can lead to the loss of any temporary configurations or credentials set up during the init script execution.

I think that this is the actual cause, but it would be great to get a deterministic statement regarding this.

Hi @andre-h ,

As a good alternative you can build the python package (wheel or egg) in your gitlab or github workflows and upload the package to dedicated cloud storage bucket. Then followed by you can specify the cloud storage path of your python library in Job dependencies which will be dynamically installed in your Job cluster when job triggered.

Regards,
Hari Prasad

Hi @hari-prasad ,

Thanks, that sounds like a very good solution as well 🙂

Photos

Upload Upload
URL URL
Saved Photos Saved Photos

Upload location

Upload location

Add Photos to Album:

New Album

Drag here to start uploading

Drag photos here or

Tap for upload options

You must install or upgrade to the latest version of Adobe Flash Player before you can upload images.