Tuesday
MY_DEPLOY_KEY: '{{secrets/<MY_SECRET_SCOPE>/my-deploy-key}}'
MY_DEPLOY_TOKEN: '{{secrets/<MY_SECRET_SCOPE>/my-deploy-token}}'
MY_DEPLOY_TOKEN_USER: '{{secrets/<MY_SECRET_SCOPE>/my-deploy-token-user}}'
mkdir -p /root/.ssh/gitlab
echo "${MY_DEPLOY_KEY}" > /root/.ssh/gitlab/id_rsa # Add private key
ssh-keyscan gitlab > /root/.ssh/known_hosts # Get host keys from server
# Configure SSH to use the private key for Gitlab:
cat << EOL > /root/.ssh/config
Host gitlab
HostName <OUR_GITLAB_HOSTNAME>
GlobalKnownHostsFile=/root/.ssh/known_hosts
User git
IdentityFile /root/.ssh/gitlab/id_rsa
EOL
chmod 600 /root/.ssh/gitlab/id_rsa /root/.ssh/known_hosts /root/.ssh/config
Host key verification failed.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
error: subprocess-exited-with-error
ร git clone --filter=blob:none --quiet 'ssh://****@<OUR_GITLAB_DOMAIN>/<...>.git' /tmp/pip-install-dpz6wll5/<...> did not run successfully.
โ exit code: 128
โฐโ> See above for output.
git config --global --add credential.helper store
cat << EOL > /root/.git-credentials
https://${MY_DEPLOY_TOKEN_USER}:${MY_DEPLOY_TOKEN}@<OUR_GITLAB_HOSTNAME>
EOL
chmod 600 /root/.git-credentials
fatal: could not read Username for 'https://<OUR_GITLAB_HOSTNAME>': No such device or address
error: subprocess-exited-with-error
I also performed the following checks:
14 hours ago
Hi @hari-prasad ,
Thanks, that sounds like a very good solution as well ๐
I managed to get it to run by using GitLab package registry as a private pypi and creating a pip.conf file with the credentials for http access in the initialization script. As I wrote, I wouldn't have expected it to work, but apparently this is the only way to make your custom library an integral part of your dependencies.
Tuesday - last edited Tuesday
Between the execution of the init script and the wheel file attachment on a job cluster, there are several factors that could block access to stored Git credentials:
Environment Isolation: Job clusters are designed to be ephemeral and isolated. This means that any environment setup done in the init script might not persist or be accessible when the job runs. This isolation ensures that each job runs in a clean environment, which can lead to the loss of any temporary configurations or credentials set up during the init script execution.
Credential Storage: The credentials set up in the init script might not be stored in a way that they are accessible to the job tasks. For example, if the credentials are written to a file, the job tasks might not have the necessary permissions or paths to access these files.
Network Configuration: The network configuration on job clusters might be different from interactive clusters. This can affect the ability to verify host keys or access external repositories, leading to issues like "Host key verification failed."
Security Policies: Job clusters might have stricter security policies that prevent the use of certain credentials or access methods. This can include restrictions on SSH keys or HTTPS tokens, leading to failures in accessing private repositories.
Best Practices for Securely Accessing Private Repositories on Job Clusters:
Use Databricks Secrets: Store your Git credentials (SSH keys or HTTPS tokens) in Databricks Secrets. This ensures that the credentials are securely managed and can be accessed by the job tasks without being exposed in the init scripts.
Environment Variables: Use environment variables to pass credentials to the job tasks. This can be done by setting the environment variables in the init script and ensuring that the job tasks are configured to read these variables.
Databricks Repos: Use Databricks Repos to manage your code. Databricks Repos integrates with Git providers and handles the authentication and access management, reducing the need to manually manage credentials.
Cluster Policies: Define cluster policies that ensure the necessary configurations and credentials are set up correctly for job clusters. This can help enforce consistent and secure access to private repositories.
Package Registry: Consider using a package registry feature offered by GitLab. You can register built artifacts and create a pip.conf
file on the node with the registry URL and its credentials as an extra URL. This method can help manage dependencies more securely and efficiently.
Tuesday
@Walter_C wrote:
Environment Isolation: Job clusters are designed to be ephemeral and isolated. This means that any environment setup done in the init script might not persist or be accessible when the job runs. This isolation ensures that each job runs in a clean environment, which can lead to the loss of any temporary configurations or credentials set up during the init script execution.
I think that this is the actual cause, but it would be great to get a deterministic statement regarding this.
yesterday
Hi @andre-h ,
As a good alternative you can build the python package (wheel or egg) in your gitlab or github workflows and upload the package to dedicated cloud storage bucket. Then followed by you can specify the cloud storage path of your python library in Job dependencies which will be dynamically installed in your Job cluster when job triggered.
14 hours ago
Hi @hari-prasad ,
Thanks, that sounds like a very good solution as well ๐
I managed to get it to run by using GitLab package registry as a private pypi and creating a pip.conf file with the credentials for http access in the initialization script. As I wrote, I wouldn't have expected it to work, but apparently this is the only way to make your custom library an integral part of your dependencies.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group