Set up compute policy to allow installing python libraries from a private package index

arne_c
New Contributor II

In our organization, we maintain a bunch of libraries we share code with. They're hosted on a private python package index, which requires a token to allow downloads. My idea was to store the token as a secret which would then be loaded into a cluster's environment arguments using a policy. The secret itself has a permissive read-access, but I myself am also a workspace admin, so I'd expect that I would be able to see it, if at all possible.

The relevant part in my policy definition looks like this:

[...],
"spark_env_vars.PIP_INDEX_URL": {
"type": "fixed",
"value": "https://arneCorpPyPI:{{secrets/global/arneCorpPyPI_token}}@gitlab.office.arneCorp.com/api/v4/groups/42/-/packages/pypi/simple"
},
[...]

If I run

databricks secrets get-secret global arneCorpPyPI_token

from my command line, I can see its value.

If I run

PIP_INDEX_URL="https://corpPyPI:$(databricks secrets get-secret qa-prediction auxpypi_token | jq -r .value)@gitlab.office.corp.com/api/v4/groups/42/-/packages/pypi/simple" pip install arne-corp-library

it will install the requested library correctly from the private index.

When I start a cluster with this policy though and start a shell, I get this:

$ echo $PIP_INDEX_URL
https://corpPyPI:{{secrets/global/corpPyPI_token}}@gitlab.office.corp.com/api/v4/groups/42/-/packages/pypi/simple

I thought that my user should have the required permissions, and from the secret-docs I assumed that the secret-access syntax I used should work in this kind of policy-config-file (my test-cluster had databricks-runtime v15.4 installed), but apparently it doesn't.

I'd like to avoid using init-scripts.

What can I do?

 

arne_c
New Contributor II

I figured it out, seems like secrets can only be loaded into environment variables if the content is the secret and nothing else:

"value": "{{secrets/global/arneCorpPyPI_token}}"         # this will work
"value": "foo {{secrets/global/arneCorpPyPI_token}} bar" # this will not

My last problem is now that I need to use string interpolation to create my actual value, e.g.:

[...],
"spark_env_vars.TOKEN": {
"type": "fixed",
"value": "{{secrets/global/arneCorpPyPI_token}}"
},,
"spark_env_vars.PIP_INDEX_URL": {
"type": "fixed",
"value": "https://arneCorpPyPI:${TOKEN}@gitlab.office.arneCorp.com/api/v4/groups/42/-/packages/pypi/simple"
},
[...] 

and json maps are unordered. As it happens, PIP_INDEX_URL is initialized before TOKEN, and my auth is broken. I tried a couple other names, and it looks like the name TEMPORARY will be consistently initialized before PIP_INDEX_URL, and it will work. Obviously, this is not something I want to rely on in any shape, way or form. Is there a better approach? I assume I'm not the first one to define env vars in a policy that depend on each other.

Hello @arne_c,

I’m working on creating a Python package that I will host on Azure DevOps. The idea is to download the package when creating different Jobs, and the way you solved the problem is exactly what I intend to use

From what I’ve seen among the proposed approaches, using compute policies seems to be the best practice for this. I wanted to ask how you resolved the issue—did you keep the Temporary declaration, or did you end up approaching it differently?

I’d like to mention that I intend to use Databricks Asset Bundles for the creation of these Jobs.

Thanks in advance, and best regards.