cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Private PyPI repos on DBR 13+

dvmentalmadess
Valued Contributor

We use a private PyPI repo (AWS CodeArtifact) to publish custom python libraries. We make the private repo available to DBR 12.2 clusters using an init-script as prescribed here in the Databricks KB.  When we tried to upgrade to 13.2 this stopped working. Specifically:

  • The cluster logs an error (see: below) and fails to start when configured using the Jobs API `library.PyPI.package` parameter (via Terraform)
  • Installing a cluster library using the Databricks UI also fails with the same error

Looking at the docs for Cluster Libraries notes the following limitation:

On Databricks Runtime 13.1 and above, cluster Python libraries are supported on clusters that use shared access mode in a Unity Catalog-enabled workspace, including Python wheels that are uploaded as workspace files.

So I looked at creating a cluster using shared access but according to the Create a cluster docs shared access mode has the following limitation: Init scripts are not supported.

Since it seems like cluster libraries won't work for DBR 13+ I looked at the documentation for workspace libraries. Unfortunately, the only way to authenticate is to store the credentials as part of the index URL. The problem here is that CodeArtifact limits authentication tokens to 12 hours, so this won't work.

I don't see a way to use a private PyPI repo for distributing our libraries - at least not using AWS CodeArtifact. Am I missing something?

Here's the error mentioned above:

Failed to attach library python-pypi;s3_ingest;;0.0.2; to Spark
org.apache.spark.SparkException: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install 's3_ingest==0.0.2' --disable-pip-version-check) exited with code 1. WARNING: The directory '/home/libraries/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
ERROR: Could not find a version that satisfies the requirement s3_ingest==0.0.2 (from versions: none)
ERROR: No matching distribution found for s3_ingest==0.0.2

 

2 ACCEPTED SOLUTIONS

Accepted Solutions

lucasvieira
New Contributor III

@dvmentalmadess I found a workaround for this issue, I'm using this init script to create the /home/libraries home folder and give permissions to the libraries user and then using the sudo -u libraries aws codeartifact login command.

 

#!/bin/bash

# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

aws --version

echo "AWS CodeArtifact login (root)"

aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx

sudo mkdir /home/libraries
sudo chown libraries:libraries /home/libraries
sudo chmod 755 /home/libraries

sudo -u libraries aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx

 

 
In my init script, I'm running the aws codeartifact login twice, one with the root user and another one with the libraries users.

It worked for me.

View solution in original post

dvmentalmadess
Valued Contributor

I'm coming back to provide an updated solution that doesn't rely on the implementation detail of the user name (e.g., libraries) - which is not considered a contract and could potentially change and break in the future.

The key is to use the --global flag when calling pip config set. Unfortunately, aws codeartifact login doesn't do this even though it does set global.index-url which is namespaced, but only for --user. The workaround is to use aws codeartifact get-authorization-token instead of login and then manually construct the index url, passing in the credentials with the token:

#!/bin/bash
set -e

echo "Authenticating to AWS Code Artifact"
CODEARTIFACT_AUTH_TOKEN=$(aws codeartifact get-authorization-token --domain <code-artifact-domain> --region <region of domain> --output text --query authorizationToken)
/databricks/python3/bin/python -m pip config --global set global.index-url https://aws:${CODEARTIFACT_AUTH_TOKEN}@<codeartifact repo url>

echo "Adding extra index to pip config"
/databricks/python3/bin/python -m pip config --global set global.extra-index-url https://pypi.org/simple

 

View solution in original post

7 REPLIES 7

@Retired_modthank you for your response.

This only happens with a job compute cluster and if I change spark_version from 12.2.x-scala2.12 to 13.2.x-scala2.12. The same job definition works fine when configured as DBR 12.2. Also, if I spin up an interactive cluster on 13.2, attach it to a notebook, and run the following it works fine as well:

 

%pip install "s3_ingest==0.0.2"

 

Also:

 

/databricks/python3/bin/python -m pip install "s3_ingest==0.0.2"

 

However, I'm not actually executing pip from an init script. I'm just using the job API libraries[*].pypi.package attribute as shown in the definition below.

To address your resolution suggestions:

  1. Library is available (running pip from notebook using interactive compute using both DBR 12.2 and 13.2)
  2. Library version is correct (same as #1 above)
  3. Install a different version (same results w/ 0.0.1: job works w/ DBR 12.2, does not work w/ DBR 13.2)
  4. Check permissions (this is a stock DBR image, I will check this but I would assume this would work since I'm using a standard Service Principal user and I haven't made any changes to this path)
  5. pip sudo (I'm not executing pip, this is being done via whatever mechanism is used by the job API

Here's the job definition in full:

 

{
    "job_id": 696122671962585,
    "creator_user_name": "<redacted>@<redacted>.com",
    "run_as_user_name": "<redacted service principal application id>",
    "run_as_owner": true,
    "settings": {
        "name": "s3_sample_test",
        "email_notifications": {},
        "webhook_notifications": {},
        "timeout_seconds": 0,
        "schedule": {
            "quartz_cron_expression": "0 0 0 * * ?",
            "timezone_id": "America/Denver",
            "pause_status": "UNPAUSED"
        },
        "max_concurrent_runs": 1,
        "tasks": [
            {
                "task_key": "main",
                "python_wheel_task": {
                    "package_name": "s3_ingest",
                    "entry_point": "s3",
                    "parameters": [
                        "--artifact-bucket",
                        "<redacted>",
                        "--conf-file",
                        "integration-pipeline/dev/<redacted>sandbox/shared/s3_sample_test/s3_ingest.yml",
                        "--source-schema",
                        "integration-pipeline/dev/<redacted>sandbox/shared/s3_sample_test/source_schema.json"
                    ]
                },
                "new_cluster": {
                    "spark_version": "13.2.x-scala2.12",
                    "aws_attributes": {
                        "instance_profile_arn": "arn:aws:iam::<redacted aws account id>:instance-profile/<redacted role name>"
                    },
                    "instance_pool_id": "0510-171953-luck30-pool-y9abm5g0",
                    "data_security_mode": "SINGLE_USER",
                    "autoscale": {
                        "min_workers": 1,
                        "max_workers": 8
                    }
                },
                "libraries": [
                    {
                        "pypi": {
                            "package": "s3_ingest==0.0.2"
                        }
                    }
                ],
                "timeout_seconds": 0,
                "email_notifications": {}
            }
        ],
        "format": "MULTI_TASK"
    },
    "created_time": 1690576962438
}

 

 

 

 

@Retired_modthank you for your reply. In response to your suggestions:

  1. Library available? I can run pip install from a notebook on interactive compute using both DBR 12.2 and 13.2
  2. Library version? Same results as 1.
  3. Different version? Same results as 1 and 2 using interactive compute. Also, job compute works for both library versions on DBR 12.2, but neither works on 13.2.
  4. Directory permissions? From a notebook running ls -lh /home/libraries/.cache/pip returns ls: cannot access '/home/libraries/.cache/pip': No such file or directory
  5. Pip sudo -H? I'm not explicitly running pip at all. I'm using the libraries[*].pypi.package attribute from the Create Job API.

Additional details

Here are the both pip commands I tried on interactive compute using both DBR 12.2 and 13.2. Both of these worked in this scenario:

%pip install "s3_ingest==0.0.2"

Also:

%sh
/databricks/python3/bin/python -m pip install "s3_ingest==0.0.2"

And here is the complete (redacted) job definition copied from workflow console UI:

{
    "job_id": 696122671962585,
    "creator_user_name": "<redacted email>",
    "run_as_user_name": "<redacted service principal application id>",
    "run_as_owner": true,
    "settings": {
        "name": "s3_sample_test",
        "email_notifications": {},
        "webhook_notifications": {},
        "timeout_seconds": 0,
        "schedule": {
            "quartz_cron_expression": "0 0 0 * * ?",
            "timezone_id": "America/Denver",
            "pause_status": "UNPAUSED"
        },
        "max_concurrent_runs": 1,
        "tasks": [
            {
                "task_key": "main",
                "python_wheel_task": {
                    "package_name": "s3_ingest",
                    "entry_point": "s3",
                    "parameters": [
                        "--artifact-bucket",
                        "<redacted s3 bucket name>",
                        "--conf-file",
                        "integration-pipeline/dev/<redacted>sandbox/shared/s3_sample_test/s3_ingest.yml",
                        "--source-schema",
                        "integration-pipeline/dev/<redacted>sandbox/shared/s3_sample_test/source_schema.json"
                    ]
                },
                "new_cluster": {
                    "spark_version": "13.2.x-scala2.12",
                    "aws_attributes": {
                        "instance_profile_arn": "arn:aws:iam::<redacted aws account id>:instance-profile/<redacted aws iam role name>"
                    },
                    "instance_pool_id": "<redacted instance pool id>",
                    "data_security_mode": "SINGLE_USER",
                    "autoscale": {
                        "min_workers": 1,
                        "max_workers": 8
                    }
                },
                "libraries": [
                    {
                        "pypi": {
                            "package": "s3_ingest==0.0.2"
                        }
                    }
                ],
                "timeout_seconds": 0,
                "email_notifications": {}
            }
        ],
        "format": "MULTI_TASK"
    },
    "created_time": 1690576962438
}

 

@Retired_mod,

I've been running some additional experiments.

Experiment 1

  1. Create DBR 12.2 cluster
  2. Once running, check to see if /home/libraries/.cache/pip exists using ls -lh (it does not)
  3. Use cluster configuration "Libraries" tab to install "s3_ingest==0.0.2" (success)
  4. Check again to see if pip cache folder exists (it does not)
  5. Use cluster configuration "Libraries" tab to uninstall "s3_ingest==0.0.2" (pending restart)

Experiment 2

  1. Edit the cluster configuration from "Experiment 1", change to DBR 13.2 then click "Confirm and Restart"
  2. Once running, check to see if /home/libraries/.cache/pip exists using ls -lh (it does not)
  3. Run /databricks/python/bin/python -m pip uninstall -y s3_ingest to verify s3_ingest is not installed (returns warning that package is not installed)
  4. Use cluster configuration "Libraries" tab to install "s3_ingest==0.0.2" (failure, same error as OP)

Experiment 3

  1. Uninstall s3_ingest==0.0.2 after "Experiment 2" (pending restart), then restart the cluster
  2. Once restart completes, run mkdir -p /home/libraries/.cache/pip then chown -R nobody:nogroup /home/libraries.
  3. Use cluster configuration "Libraries" tab to install "s3_ingest==0.0.2" (failure, same error as OP)

Conclusion

This seems like a bug to me. I'm going to look into running pip install from an init script as a workaround but the "Libraries" functionality seems broken to me - both for interactive and job compute. As I've verified, pip does not require sudo or sudo -H, it runs just fine on it's own. Regardless, the "Libraries" functionality does not provide the ability to specify sudo which means I can't use the feature and I likely need to create a pip install script for every job instead of using the built-in "libraries" functionality with "pypi".

lucasvieira
New Contributor III

+1, same issue here, I'm using AWS CodeArtifact as private PyPI and worked in the DBR 12.x and it's not working anymore in the DBR 13.2.

I'm using the same pattern as @dvmentalmadess, init script to do the AWS CodeArtifact login.

lucasvieira
New Contributor III

@dvmentalmadess I found a workaround for this issue, I'm using this init script to create the /home/libraries home folder and give permissions to the libraries user and then using the sudo -u libraries aws codeartifact login command.

 

#!/bin/bash

# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

aws --version

echo "AWS CodeArtifact login (root)"

aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx

sudo mkdir /home/libraries
sudo chown libraries:libraries /home/libraries
sudo chmod 755 /home/libraries

sudo -u libraries aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx

 

 
In my init script, I'm running the aws codeartifact login twice, one with the root user and another one with the libraries users.

It worked for me.

dvmentalmadess
Valued Contributor

@lucasvieira  thanks for tracking down the fix! It solves the problem as described so I'm marking your post as the solution.

In case it helps someone else, my original posts left out a detail from our setup in order to make the problem easy to replicate. Our CodeArtifact does not cache dependencies. Instead we configure global.extra-index-url to pull public depdendencies. Here's how our script was different:

 

#!/bin/bash

# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

aws --version

echo "AWS CodeArtifact login (root)"

aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx

# here's the additional setup 
/databricks/python3/bin/python -m pip config set global.extra-index-url https://pypi.org/simple

 

As a result I was seeing errors telling me my dependencies could not be found. To fix the issue, I had to set global.extra-index-url for the libraries user as well. Here's the full script:

 

#!/bin/bash

# Install AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

aws --version

echo "AWS CodeArtifact login (root)"

aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx

/databricks/python3/bin/python -m pip config set global.extra-index-url https://pypi.org/simple

sudo mkdir /home/libraries
sudo chown libraries:libraries /home/libraries
sudo chmod 755 /home/libraries

sudo -u libraries aws codeartifact login --tool pip --repository xxxxxx --domain xxxxxx --domain-owner xxxxxx

sudo -u libraries /databricks/python3/bin/python -m pip config set global.extra-index-url https://pypi.org/simple

 

UPDATE: Also, if you are running this as a global init-script and must support versions older than 12.2 LTS (e.g., 11.3 LTS), make sure you wrap the code for the libraries user in an if block to check if the user libraries exists. Otherwise, this script will fail on 11.3.

if id "libraries" >/dev/null 2>&1; then
  # commands go here
fi

dvmentalmadess
Valued Contributor

I'm coming back to provide an updated solution that doesn't rely on the implementation detail of the user name (e.g., libraries) - which is not considered a contract and could potentially change and break in the future.

The key is to use the --global flag when calling pip config set. Unfortunately, aws codeartifact login doesn't do this even though it does set global.index-url which is namespaced, but only for --user. The workaround is to use aws codeartifact get-authorization-token instead of login and then manually construct the index url, passing in the credentials with the token:

#!/bin/bash
set -e

echo "Authenticating to AWS Code Artifact"
CODEARTIFACT_AUTH_TOKEN=$(aws codeartifact get-authorization-token --domain <code-artifact-domain> --region <region of domain> --output text --query authorizationToken)
/databricks/python3/bin/python -m pip config --global set global.index-url https://aws:${CODEARTIFACT_AUTH_TOKEN}@<codeartifact repo url>

echo "Adding extra index to pip config"
/databricks/python3/bin/python -m pip config --global set global.extra-index-url https://pypi.org/simple

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group