Hello,
We intend to deploy a Databricks workflow based on a Python wheel file which needs to run on a job cluster. There is a dependency declared in pyproject.toml which is another Python project living in a private Gitlab repository. We therefore need to provide secure access to our Gitlab domain to the cluster.
We do not want to declare the dependency URL in pyproject.toml/requirements.txt in a form containing credentials. The .whl file metadata needs to be devoid of any credentials.
However, the library still needs to be declared as a dependency of the main code somehow, for proper downstream dependency management.
There are two repository-specific ways to do this in an automated fashion: Either via a deploy key, which is a project-specific SSH key, or via a deploy token for HTTPS access. Both ways would work, provided we can get these credentials to be usable on the nodes. Unfortunately, it is being difficult.
This is how I have proceeded so far to try out both ways (still not super secure, as working with plain-text secrets, but these are first steps):
Step 1
I created a deploy key pair and added the public key to the Gitlab project. I also created a deploy token.
Step 2
I created a secret scope containing the secrets (private key of deploy key pair and deploy token username/token pair) and configured the job cluster to contain the following environment variables, so they can be used in initialization:
MY_DEPLOY_KEY: '{{secrets/<MY_SECRET_SCOPE>/my-deploy-key}}'
MY_DEPLOY_TOKEN: '{{secrets/<MY_SECRET_SCOPE>/my-deploy-token}}'
MY_DEPLOY_TOKEN_USER: '{{secrets/<MY_SECRET_SCOPE>/my-deploy-token-user}}'
Step 3a (SSH)
The job cluster init script configures SSH to use the MY_DEPLOY_KEY variable for our Gitlab:
mkdir -p /root/.ssh/gitlab
echo "${MY_DEPLOY_KEY}" > /root/.ssh/gitlab/id_rsa # Add private key
ssh-keyscan gitlab > /root/.ssh/known_hosts # Get host keys from server
# Configure SSH to use the private key for Gitlab:
cat << EOL > /root/.ssh/config
Host gitlab
HostName <OUR_GITLAB_HOSTNAME>
GlobalKnownHostsFile=/root/.ssh/known_hosts
User git
IdentityFile /root/.ssh/gitlab/id_rsa
EOL
chmod 600 /root/.ssh/gitlab/id_rsa /root/.ssh/known_hosts /root/.ssh/config
When running this in a cell on an interactive cluster, this works, I can access the repository. It also works on my local computer. The job cluster however fails to install the dependency from the repository because the host cannot be verified. This is the log4j output:
Host key verification failed.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
error: subprocess-exited-with-error
× git clone --filter=blob:none --quiet 'ssh://****@<OUR_GITLAB_DOMAIN>/<...>.git' /tmp/pip-install-dpz6wll5/<...> did not run successfully.
│ exit code: 128
╰─> See above for output.
Step 3b (HTTPS)
The job cluster init script configures Git to use the MY_DEPLOY_TOKEN_USER and MY_DEPLOY_TOKEN variables for Gitlab:
git config --global --add credential.helper store
cat << EOL > /root/.git-credentials
https://${MY_DEPLOY_TOKEN_USER}:${MY_DEPLOY_TOKEN}@<OUR_GITLAB_HOSTNAME>
EOL
chmod 600 /root/.git-credentials
Again, this works when executed on an interactive cluster, but on a job cluster, the library installation fails. Log4j output:
fatal: could not read Username for 'https://<OUR_GITLAB_HOSTNAME>': No such device or address
error: subprocess-exited-with-error
Step 4 (further checks)
I also performed the following checks:
- The files created by the init scripts do not get deleted after the script finishes executing, so the environment does not get "refreshed" between init script execution and wheel file attachment to the job.
- Installing the libraries directly from inside the init scripts works when setting up the credentials in these two ways. However, this is not how we want to do it.
- Installing the libraries when the main code is already running also works (e.g. running subprocess(pip install ...)).
A different method I thought of but did not test yet is the package registry feature offered by Gitlab. A built artifact can be registered there and one could create a pip.conf file on the node with the registry URL and its credentials as an extra URL. However, I doubt that this works, because under the hood, I guess the same commands are executed.
Question:
What is happening between init script execution and wheel file attachment which could block the access to stored Git credentials on a job cluster?
What are best practices for securely accessing private repositories on job clusters?