cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to install private repository as package dependency in Databricks Workflow

Junda
New Contributor III

I am a member of the development team in our company and we use Databricks as sort of like ETL tool. We utilize git integration for our program and run Workflow daily basis. Recently, we created another company internal private git repository and want to automatically install/use these packages in our program. I know we can install private repository by specifying repository URL with authentication information in requirements.txt as follows.

ใƒปgit+https://<your-username>:<your-token>@github.com/<your-username>/<your-private-repo>.git@main#egg=<package-name>
ใƒปgit+ssh://git@github.com/<your-username>/<your-private-repo>.git@main#egg=<package-name>

However, these methods are the ways tied to the individual, so I suppose it's not suitable for large project. ใ€€

I found there is this token called "Deploy Token" which does not tie to individual account, rather links to specific repository, but apparently you need to store secret key into cluster every time you run Workflow.

Is there any ways to install/use packages in private repository for another program, or is there any missing ideas/features that I don't know about Databricks?

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

You can install and use private repository packages in Databricks workflows in a scalable and secure way, but there are trade-offs and best practices to consider for robust, team-friendly automation. Here's a direct answer and a breakdown of solutions and strategies.

It is possible to automate the installation of packages from an internal private repository without tying authentication to a single individual. The most robust, enterprise-friendly options include using Databricks Secrets, Workspace Libraries, and centralized tokens with automation, rather than storing credentials in code or per-user files.


Secure Package Installation Strategies

1. Databricks Secrets for Credential Management

  • Store deployment tokens (including GitHub Deploy Tokens, Service Account tokens, or SSH keys) securely in Databricks Secrets.

  • Reference these secrets in your Workflow or notebook so that the credentials are injected programmatically at runtime, not hardcoded or tied to any single developer's account.

  • This minimizes risk and centralizes managementโ€”if a token needs rotation, update it in the secret scope.

2. Workspace Libraries & Automated Jobs

  • Build your internal library as a wheel (.whl) or egg using your CI/CD pipeline on the private repo.

  • Upload the built artifact to a centralized, access-controlled cloud location (like DBFS, Azure Blob, S3), then attach/install it to your Databricks clusters/workflows.

  • This makes package management more predictable and avoids runtime git pulls, as well as working well with dependency resolution.

3. Use of Personal Access Tokens (PAT) vs Deploy Tokens

  • Deploy Tokens are preferable to personal tokens because they're bound to repo/project, not individuals. Store them in Secrets; do not hardcode.

  • Consider using machine user accounts with limited permissions for deploy automation.

  • For SSH, store the private SSH key in Secrets and use the key when running pip install commands referencing private repos.

4. Databricks Repos, Git Integration, and CI/CD

  • Databricks Repos support syncing notebooks and source code directly from a git repo, but are less suited for direct pip install of Python packages.

  • Instead, trigger a CI/CD job (GitHub Actions, Azure DevOps, etc.) to build and release your package. Databricks can then automate fetch/install via artifact management.

5. Managing Secrets for Workflow Runs

  • You do not need to reload secrets for every workflow run. Set cluster secrets at setup time, and reference them for each job. Databricks offers APIs and UI methods to automate secret management.


Recommended Workflow Example

  1. Build your package in CI/CD (wheel, source distribution, etc.).

  2. Upload the artifact to a cloud location accessible by Databricks.

  3. Store authentication credentials in a Databricks secret scope.

  4. Use init scripts or notebook automation to install the package in the cluster, referencing the secret for authentication if pulling from a private repo.

  5. Rotate and audit secrets regularly.


Key Points & Best Practices

  • For team-wide usage across workflows, store credentials in Databricks Secret Scopes, not environment variables or code.

  • Use project-bound tokens ("Deploy Tokens") and machine/service accounts for automation.

  • Avoid tying install processes to individual users; prioritize project secrets managed by admins.

  • If your package changes frequently, automate build/deploy with CI/CD, and only install versioned artifacts in Databricks.


Missing Features or Improvements

  • Databricks lacks fully native pip integration for private repos without some credential management.

  • Marketplace or shared workspace library management (notebooks, code artifacts) is possible, but not as seamless as public package install.

  • Newer features (like Databricks CLI and Terraform integration) allow automation of resource and secret provisioning.


References

Solution Scalable Secure Tied to Individual Main Management Area
Personal Token in URL No Low Yes Developer
Deploy Token in Secret Yes High No Admin/Project
SSH Key via Secret Yes High No Admin/Project
Upload Package to DBFS Yes High No CI/CD & Databricks
 
 

By following these best practices, you can securely and efficiently use private repository packages within Databricks, supporting both development and production workflows without risk of credential leaks or individual account dependencies.