Set up compute policy to allow installing python libraries from a private package index
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-10-2024 04:34 AM - edited 10-10-2024 04:38 AM
In our organization, we maintain a bunch of libraries we share code with. They're hosted on a private python package index, which requires a token to allow downloads. My idea was to store the token as a secret which would then be loaded into a cluster's environment arguments using a policy. The secret itself has a permissive read-access, but I myself am also a workspace admin, so I'd expect that I would be able to see it, if at all possible.
The relevant part in my policy definition looks like this:
[...],
"spark_env_vars.PIP_INDEX_URL": {
"type": "fixed",
"value": "https://arneCorpPyPI:{{secrets/global/arneCorpPyPI_token}}@gitlab.office.arneCorp.com/api/v4/groups/42/-/packages/pypi/simple"
},
[...]
If I run
databricks secrets get-secret global arneCorpPyPI_token
from my command line, I can see its value.
If I run
PIP_INDEX_URL="https://corpPyPI:$(databricks secrets get-secret qa-prediction auxpypi_token | jq -r .value)@gitlab.office.corp.com/api/v4/groups/42/-/packages/pypi/simple" pip install arne-corp-library
it will install the requested library correctly from the private index.
When I start a cluster with this policy though and start a shell, I get this:
$ echo $PIP_INDEX_URL
https://corpPyPI:{{secrets/global/corpPyPI_token}}@gitlab.office.corp.com/api/v4/groups/42/-/packages/pypi/simple
I thought that my user should have the required permissions, and from the secret-docs I assumed that the secret-access syntax I used should work in this kind of policy-config-file (my test-cluster had databricks-runtime v15.4 installed), but apparently it doesn't.
I'd like to avoid using init-scripts.
What can I do?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-11-2024 04:29 AM
I figured it out, seems like secrets can only be loaded into environment variables if the content is the secret and nothing else:
"value": "{{secrets/global/arneCorpPyPI_token}}" # this will work
"value": "foo {{secrets/global/arneCorpPyPI_token}} bar" # this will not
My last problem is now that I need to use string interpolation to create my actual value, e.g.:
[...],
"spark_env_vars.TOKEN": {
"type": "fixed",
"value": "{{secrets/global/arneCorpPyPI_token}}"
},,
"spark_env_vars.PIP_INDEX_URL": {
"type": "fixed",
"value": "https://arneCorpPyPI:${TOKEN}@gitlab.office.arneCorp.com/api/v4/groups/42/-/packages/pypi/simple"
},
[...]
and json maps are unordered. As it happens, PIP_INDEX_URL is initialized before TOKEN, and my auth is broken. I tried a couple other names, and it looks like the name TEMPORARY will be consistently initialized before PIP_INDEX_URL, and it will work. Obviously, this is not something I want to rely on in any shape, way or form. Is there a better approach? I assume I'm not the first one to define env vars in a policy that depend on each other.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-23-2025 04:19 AM - edited 02-23-2025 04:23 AM
Hello @arne_c,
I’m working on creating a Python package that I will host on Azure DevOps. The idea is to download the package when creating different Jobs, and the way you solved the problem is exactly what I intend to use
From what I’ve seen among the proposed approaches, using compute policies seems to be the best practice for this. I wanted to ask how you resolved the issue—did you keep the Temporary declaration, or did you end up approaching it differently?
I’d like to mention that I intend to use Databricks Asset Bundles for the creation of these Jobs.
Thanks in advance, and best regards.

