Databricks Community

vinitkhandelwal · ‎01-16-2024

My project's setup.py file

from setuptools import find_packages, setup

PACKAGE_REQUIREMENTS = ["pyyaml","confluent-kafka", "fastavro", "python-dotenv","boto3", "pyxlsb", "aiohttp", "myprivatepackage"]

LOCAL_REQUIREMENTS = ["delta-spark", "scikit-learn", "pandas", "mlflow", "databricks-sql-connector", "kafka-python"]

TEST_REQUIREMENTS = ["pytest", "coverage[toml]", "pytest-cov", "dbx>=0.7,<0.8"]

setup(
    name="my_project",
    packages=find_packages(exclude=["tests", "tests.*"]),
    setup_requires=["setuptools","wheel"],
    install_requires=PACKAGE_REQUIREMENTS,
    extras_require={"local": LOCAL_REQUIREMENTS, "test": TEST_REQUIREMENTS},
    entry_points = {
        "console_scripts": [
            "etl = my_project.tasks.sample_etl_task:entrypoint"
        ]
    },
    version=__version__,
    description="My project",
    author="me",
)

I am using dbx to deploy so here is how my deployment.yaml looks like

environments:
  dev:
    workflows:
      - name: "mytask"  
        tasks:
          - task_key: "mytask"
            new_cluster:
              spark_version: "14.2.x-scala2.12"
              node_type_id: "r5d.large"
              data_security_mode: "SINGLE_USER"
              spark_conf:
                spark.databricks.delta.preview.enabled: 'true'
                spark.databricks.cluster.profile: 'singleNode'
                spark.master: 'local[*, 4]'
              runtime_engine: STANDARD
              num_workers: 0  
            spark_python_task:
              python_file: "file://my_project/entity/mytask/tasks/mytask.py"

Then I run the following command to deploy

dbx deploy --deployment-file ./conf/dev/deployment.yml -e dev

It deploys fine. No errors!
But when I run the job, I get the following error

/local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install --upgrade /local_disk0/tmp/abc/[REDACTED]-0.8.0-py3-none-any.whl --disable-pip-version-check) exited with code 1, and Processing /local_disk0/tmp/abc/[REDACTED]-0.8.0-py3-none-any.whl
24/01/16 12:13:43 INFO SharedDriverContext: Failed to attach library dbfs:/Shared/dbx/projects/[REDACTED]/abc/artifacts/dist/[REDACTED]-0.8.0-py3-none-any.whl to Spark
java.lang.Throwable: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install --upgrade /local_disk0/tmp/abc/[REDACTED]-0.8.0-py3-none-any.whl --disable-pip-version-check) exited with code 1. ERROR: Could not find a version that satisfies the requirement myprivatepackage (from [REDACTED]) (from versions: none)
ERROR: No matching distribution found for myprivatepackage

How do I resolve this?

vinitkhandelwal · ‎07-03-2024

I added init script to compute in order to add details of private package login in /etc/pip.conf

Something as follows:

resource "databricks_workspace_file" "gitlab_pypi_init_script" {
  provider = databricks.workspace
  content_base64 = base64encode(<<-EOT
    #!/bin/bash
    if [[ $PYPI_TOKEN ]]; then
    use $PYPI_TOKEN
    fi
    echo $PYPI_TOKEN
    printf "[global]\n" > /etc/pip.conf
    printf "extra-index-url =\n" >> /etc/pip.conf
    printf "\thttps://__token__:$PYPI_TOKEN@gitlab.com/api/v4/projects/12345678/packages/pypi/simple\n" >> /etc/pip.conf
    EOT
  )
  path = "/FileStore/gitlab_pypi_init_script.sh"
}

I added this file to Workspace and then referenced it under init scripts in cluster compute and it worked to install private package in the cluster when it starts

I also made sure gitlab token was accessible with variable PYPI_TOKEN using Spark environment variables

View solution in original post

Debayan · ‎01-19-2024

Hi, Does this look like a dependency error? All the dependencies are packed in the whl? Also, could you please confirm if all the limitations are satified? Refer: https://docs.databricks.com/en/compute/access-mode-limitations.html

vinitkhandelwal · ‎07-03-2024

I added init script to compute in order to add details of private package login in /etc/pip.conf

Something as follows:

resource "databricks_workspace_file" "gitlab_pypi_init_script" {
  provider = databricks.workspace
  content_base64 = base64encode(<<-EOT
    #!/bin/bash
    if [[ $PYPI_TOKEN ]]; then
    use $PYPI_TOKEN
    fi
    echo $PYPI_TOKEN
    printf "[global]\n" > /etc/pip.conf
    printf "extra-index-url =\n" >> /etc/pip.conf
    printf "\thttps://__token__:$PYPI_TOKEN@gitlab.com/api/v4/projects/12345678/packages/pypi/simple\n" >> /etc/pip.conf
    EOT
  )
  path = "/FileStore/gitlab_pypi_init_script.sh"
}

I added this file to Workspace and then referenced it under init scripts in cluster compute and it worked to install private package in the cluster when it starts

I also made sure gitlab token was accessible with variable PYPI_TOKEN using Spark environment variables