cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Set up compute policy to allow installing python libraries from a private package index

arne_c
New Contributor

In our organization, we maintain a bunch of libraries we share code with. They're hosted on a private python package index, which requires a token to allow downloads. My idea was to store the token as a secret which would then be loaded into a cluster's environment arguments using a policy. The secret itself has a permissive read-access, but I myself am also a workspace admin, so I'd expect that I would be able to see it, if at all possible.

The relevant part in my policy definition looks like this:

[...],
"spark_env_vars.PIP_INDEX_URL": {
"type": "fixed",
"value": "https://arneCorpPyPI:{{secrets/global/arneCorpPyPI_token}}@gitlab.office.arneCorp.com/api/v4/groups/42/-/packages/pypi/simple"
},
[...]

If I run

databricks secrets get-secret global arneCorpPyPI_token

from my command line, I can see its value.

If I run

PIP_INDEX_URL="https://corpPyPI:$(databricks secrets get-secret qa-prediction auxpypi_token | jq -r .value)@gitlab.office.corp.com/api/v4/groups/42/-/packages/pypi/simple" pip install arne-corp-library

it will install the requested library correctly from the private index.

When I start a cluster with this policy though and start a shell, I get this:

$ echo $PIP_INDEX_URL
https://corpPyPI:{{secrets/global/corpPyPI_token}}@gitlab.office.corp.com/api/v4/groups/42/-/packages/pypi/simple

I thought that my user should have the required permissions, and from the secret-docs I assumed that the secret-access syntax I used should work in this kind of policy-config-file (my test-cluster had databricks-runtime v15.4 installed), but apparently it doesn't.

I'd like to avoid using init-scripts.

What can I do?

 

1 REPLY 1

arne_c
New Contributor

I figured it out, seems like secrets can only be loaded into environment variables if the content is the secret and nothing else:

"value": "{{secrets/global/arneCorpPyPI_token}}"         # this will work
"value": "foo {{secrets/global/arneCorpPyPI_token}} bar" # this will not

My last problem is now that I need to use string interpolation to create my actual value, e.g.:

[...],
"spark_env_vars.TOKEN": {
"type": "fixed",
"value": "{{secrets/global/arneCorpPyPI_token}}"
},,
"spark_env_vars.PIP_INDEX_URL": {
"type": "fixed",
"value": "https://arneCorpPyPI:${TOKEN}@gitlab.office.arneCorp.com/api/v4/groups/42/-/packages/pypi/simple"
},
[...] 

and json maps are unordered. As it happens, PIP_INDEX_URL is initialized before TOKEN, and my auth is broken. I tried a couple other names, and it looks like the name TEMPORARY will be consistently initialized before PIP_INDEX_URL, and it will work. Obviously, this is not something I want to rely on in any shape, way or form. Is there a better approach? I assume I'm not the first one to define env vars in a policy that depend on each other.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group