โ08-01-2023 02:37 PM
We use a private PyPI repo (AWS CodeArtifact) to publish custom python libraries. We make the private repo available to DBR 12.2 clusters using an init-script as prescribed here in the Databricks KB. When we tried to upgrade to 13.2 this stopped working. Specifically:
Looking at the docs for Cluster Libraries notes the following limitation:
On Databricks Runtime 13.1 and above, cluster Python libraries are supported on clusters that use shared access mode in a Unity Catalog-enabled workspace, including Python wheels that are uploaded as workspace files.
So I looked at creating a cluster using shared access but according to the Create a cluster docs shared access mode has the following limitation: Init scripts are not supported.
Since it seems like cluster libraries won't work for DBR 13+ I looked at the documentation for workspace libraries. Unfortunately, the only way to authenticate is to store the credentials as part of the index URL. The problem here is that CodeArtifact limits authentication tokens to 12 hours, so this won't work.
I don't see a way to use a private PyPI repo for distributing our libraries - at least not using AWS CodeArtifact. Am I missing something?
Here's the error mentioned above:
Failed to attach library python-pypi;s3_ingest;;0.0.2; to Spark
org.apache.spark.SparkException: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install 's3_ingest==0.0.2' --disable-pip-version-check) exited with code 1. WARNING: The directory '/home/libraries/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
ERROR: Could not find a version that satisfies the requirement s3_ingest==0.0.2 (from versions: none)
ERROR: No matching distribution found for s3_ingest==0.0.2
โ08-07-2023 05:17 AM - edited โ08-07-2023 05:18 AM
@dvmentalmadess I found a workaround for this issue, I'm using this init script to create the /home/libraries home folder and give permissions to the libraries user and then using the sudo -u libraries aws codeartifact login command.
#!/bin/bash
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
aws --version
echo "AWS CodeArtifact login (root)"
aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx
sudo mkdir /home/libraries
sudo chown libraries:libraries /home/libraries
sudo chmod 755 /home/libraries
sudo -u libraries aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx
It worked for me.
โ04-26-2024 09:58 AM
I'm coming back to provide an updated solution that doesn't rely on the implementation detail of the user name (e.g., libraries) - which is not considered a contract and could potentially change and break in the future.
The key is to use the --global flag when calling pip config set. Unfortunately, aws codeartifact login doesn't do this even though it does set global.index-url which is namespaced, but only for --user. The workaround is to use aws codeartifact get-authorization-token instead of login and then manually construct the index url, passing in the credentials with the token:
#!/bin/bash
set -e
echo "Authenticating to AWS Code Artifact"
CODEARTIFACT_AUTH_TOKEN=$(aws codeartifact get-authorization-token --domain <code-artifact-domain> --region <region of domain> --output text --query authorizationToken)
/databricks/python3/bin/python -m pip config --global set global.index-url https://aws:${CODEARTIFACT_AUTH_TOKEN}@<codeartifact repo url>
echo "Adding extra index to pip config"
/databricks/python3/bin/python -m pip config --global set global.extra-index-url https://pypi.org/simple
โ08-02-2023 09:45 AM - edited โ08-02-2023 09:47 AM
@Retired_modthank you for your response.
This only happens with a job compute cluster and if I change spark_version from 12.2.x-scala2.12 to 13.2.x-scala2.12. The same job definition works fine when configured as DBR 12.2. Also, if I spin up an interactive cluster on 13.2, attach it to a notebook, and run the following it works fine as well:
%pip install "s3_ingest==0.0.2"
Also:
/databricks/python3/bin/python -m pip install "s3_ingest==0.0.2"
However, I'm not actually executing pip from an init script. I'm just using the job API libraries[*].pypi.package attribute as shown in the definition below.
To address your resolution suggestions:
Here's the job definition in full:
{
"job_id": 696122671962585,
"creator_user_name": "<redacted>@<redacted>.com",
"run_as_user_name": "<redacted service principal application id>",
"run_as_owner": true,
"settings": {
"name": "s3_sample_test",
"email_notifications": {},
"webhook_notifications": {},
"timeout_seconds": 0,
"schedule": {
"quartz_cron_expression": "0 0 0 * * ?",
"timezone_id": "America/Denver",
"pause_status": "UNPAUSED"
},
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "main",
"python_wheel_task": {
"package_name": "s3_ingest",
"entry_point": "s3",
"parameters": [
"--artifact-bucket",
"<redacted>",
"--conf-file",
"integration-pipeline/dev/<redacted>sandbox/shared/s3_sample_test/s3_ingest.yml",
"--source-schema",
"integration-pipeline/dev/<redacted>sandbox/shared/s3_sample_test/source_schema.json"
]
},
"new_cluster": {
"spark_version": "13.2.x-scala2.12",
"aws_attributes": {
"instance_profile_arn": "arn:aws:iam::<redacted aws account id>:instance-profile/<redacted role name>"
},
"instance_pool_id": "0510-171953-luck30-pool-y9abm5g0",
"data_security_mode": "SINGLE_USER",
"autoscale": {
"min_workers": 1,
"max_workers": 8
}
},
"libraries": [
{
"pypi": {
"package": "s3_ingest==0.0.2"
}
}
],
"timeout_seconds": 0,
"email_notifications": {}
}
],
"format": "MULTI_TASK"
},
"created_time": 1690576962438
}
โ08-02-2023 10:14 AM
@Retired_modthank you for your reply. In response to your suggestions:
Here are the both pip commands I tried on interactive compute using both DBR 12.2 and 13.2. Both of these worked in this scenario:
%pip install "s3_ingest==0.0.2"
Also:
%sh
/databricks/python3/bin/python -m pip install "s3_ingest==0.0.2"
And here is the complete (redacted) job definition copied from workflow console UI:
{
"job_id": 696122671962585,
"creator_user_name": "<redacted email>",
"run_as_user_name": "<redacted service principal application id>",
"run_as_owner": true,
"settings": {
"name": "s3_sample_test",
"email_notifications": {},
"webhook_notifications": {},
"timeout_seconds": 0,
"schedule": {
"quartz_cron_expression": "0 0 0 * * ?",
"timezone_id": "America/Denver",
"pause_status": "UNPAUSED"
},
"max_concurrent_runs": 1,
"tasks": [
{
"task_key": "main",
"python_wheel_task": {
"package_name": "s3_ingest",
"entry_point": "s3",
"parameters": [
"--artifact-bucket",
"<redacted s3 bucket name>",
"--conf-file",
"integration-pipeline/dev/<redacted>sandbox/shared/s3_sample_test/s3_ingest.yml",
"--source-schema",
"integration-pipeline/dev/<redacted>sandbox/shared/s3_sample_test/source_schema.json"
]
},
"new_cluster": {
"spark_version": "13.2.x-scala2.12",
"aws_attributes": {
"instance_profile_arn": "arn:aws:iam::<redacted aws account id>:instance-profile/<redacted aws iam role name>"
},
"instance_pool_id": "<redacted instance pool id>",
"data_security_mode": "SINGLE_USER",
"autoscale": {
"min_workers": 1,
"max_workers": 8
}
},
"libraries": [
{
"pypi": {
"package": "s3_ingest==0.0.2"
}
}
],
"timeout_seconds": 0,
"email_notifications": {}
}
],
"format": "MULTI_TASK"
},
"created_time": 1690576962438
}
โ08-02-2023 03:09 PM
I've been running some additional experiments.
This seems like a bug to me. I'm going to look into running pip install from an init script as a workaround but the "Libraries" functionality seems broken to me - both for interactive and job compute. As I've verified, pip does not require sudo or sudo -H, it runs just fine on it's own. Regardless, the "Libraries" functionality does not provide the ability to specify sudo which means I can't use the feature and I likely need to create a pip install script for every job instead of using the built-in "libraries" functionality with "pypi".
โ08-04-2023 09:01 AM
+1, same issue here, I'm using AWS CodeArtifact as private PyPI and worked in the DBR 12.x and it's not working anymore in the DBR 13.2.
I'm using the same pattern as @dvmentalmadess, init script to do the AWS CodeArtifact login.
โ08-07-2023 05:17 AM - edited โ08-07-2023 05:18 AM
@dvmentalmadess I found a workaround for this issue, I'm using this init script to create the /home/libraries home folder and give permissions to the libraries user and then using the sudo -u libraries aws codeartifact login command.
#!/bin/bash
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
aws --version
echo "AWS CodeArtifact login (root)"
aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx
sudo mkdir /home/libraries
sudo chown libraries:libraries /home/libraries
sudo chmod 755 /home/libraries
sudo -u libraries aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx
It worked for me.
โ08-07-2023 10:13 AM - edited โ08-07-2023 11:55 AM
@lucasvieira thanks for tracking down the fix! It solves the problem as described so I'm marking your post as the solution.
In case it helps someone else, my original posts left out a detail from our setup in order to make the problem easy to replicate. Our CodeArtifact does not cache dependencies. Instead we configure global.extra-index-url to pull public depdendencies. Here's how our script was different:
#!/bin/bash
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
aws --version
echo "AWS CodeArtifact login (root)"
aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx
# here's the additional setup
/databricks/python3/bin/python -m pip config set global.extra-index-url https://pypi.org/simple
As a result I was seeing errors telling me my dependencies could not be found. To fix the issue, I had to set global.extra-index-url for the libraries user as well. Here's the full script:
#!/bin/bash
# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
aws --version
echo "AWS CodeArtifact login (root)"
aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx
/databricks/python3/bin/python -m pip config set global.extra-index-url https://pypi.org/simple
sudo mkdir /home/libraries
sudo chown libraries:libraries /home/libraries
sudo chmod 755 /home/libraries
sudo -u libraries aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx
sudo -u libraries /databricks/python3/bin/python -m pip config set global.extra-index-url https://pypi.org/simple
UPDATE: Also, if you are running this as a global init-script and must support versions older than 12.2 LTS (e.g., 11.3 LTS), make sure you wrap the code for the libraries user in an if block to check if the user libraries exists. Otherwise, this script will fail on 11.3.
if id "libraries" >/dev/null 2>&1; then
# commands go here
fi
โ04-26-2024 09:58 AM
I'm coming back to provide an updated solution that doesn't rely on the implementation detail of the user name (e.g., libraries) - which is not considered a contract and could potentially change and break in the future.
The key is to use the --global flag when calling pip config set. Unfortunately, aws codeartifact login doesn't do this even though it does set global.index-url which is namespaced, but only for --user. The workaround is to use aws codeartifact get-authorization-token instead of login and then manually construct the index url, passing in the credentials with the token:
#!/bin/bash
set -e
echo "Authenticating to AWS Code Artifact"
CODEARTIFACT_AUTH_TOKEN=$(aws codeartifact get-authorization-token --domain <code-artifact-domain> --region <region of domain> --output text --query authorizationToken)
/databricks/python3/bin/python -m pip config --global set global.index-url https://aws:${CODEARTIFACT_AUTH_TOKEN}@<codeartifact repo url>
echo "Adding extra index to pip config"
/databricks/python3/bin/python -m pip config --global set global.extra-index-url https://pypi.org/simple
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group