cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks and AWS CodeArtifact

axelboursin
New Contributor II

Hello, I saw multiple topics about it, but I need explanations and a solution.

In my context, we have developers that are developing Python projects, like X.
In Databricks, we have a cluster with a library of the main project A that is dependent of X.

pyproject.toml is like :
[tool.poetry.dependencies]
X = { source = "codeartifact", version = "0.1.0" }

[[tool.poetry.source]]
name = "codeartifact"
url = "https://domain-ownerid.d.codeartifact.region-name.amazonaws.com/pypi/repo/simple/"
priority = "supplemental"

Cluster is searching on public PyPi repositories.

Thanks for your answers and your help!

3 REPLIES 3

axelboursin
New Contributor II

I saw that solution may be in the init script but it's not really essy to work with.

I mean, there's no log generated from the bash script, so this is not an easy way to solve my problem, maybe you have some advices about it?

stbjelcevic
Databricks Employee
Databricks Employee

Hi @axelboursin ,

I think this article will help you out: https://docs.databricks.com/aws/en/admin/workspace-settings/default-python-packages (option 1 below).

Recommended approaches (choose based on your environment):

  • For broad, consistent behavior across clusters and notebooks, configure workspace-level default Python package repositories to point to CodeArtifact; this avoids per-notebook tokens and works for both serverless and classic compute.

  • (The thing you mentioned) For classic clusters only, add a cluster-scoped init script that runs aws codeartifact login and writes pip config, so pip automatically resolves from your CodeArtifact repo at cluster start.

  • For one-off installs or testing, use notebook-scoped %pip with --index-url (and --extra-index-url as needed) and credentials pulled from Databricks Secrets inside the notebook.

theclipse
New Contributor II

Hi @stbjelcevic,
actually we are still facing a mystery.
We created an init_script in order to generate a codeartifact token and change pip conf. Here isthe content:

pip install awscli

export AWS_CODEARTIFACT_TOKEN=$(aws codeartifact get-authorization-token \
--domain $DOMAIN \
--domain-owner $OWNER \
--region $REGION \
--query authorizationToken \
--output text)

PIP_INDEX_URL="https://aws:${AWS_CODEARTIFACT_TOKEN}@${DOMAIN}-${OWNER}.d.codeartifact.${REGION}.amazonaws.com/pypi/${REPOSITORY}/simple/"
pip config set global.index-url $PIP_INDEX_URL
pip config set global.extra-index-url https://pypi.org/simple/

When we are installing package from notebook (with `!pip install ${name_of_package_in_codeartifact}`) it works.
When we are working with jobs that is installing wheel directly, it does not work. 
When trying to install codeartifact libraries from cluster ui (libraries tab), it does not work either.

I do not understand, it looks like it uses a different pip, or that initscript action on pip are not effective.

Would you have any idea?