Deploy python application with submodules - Poetry...

57410 · ‎04-08-2024

Hi,

I'm using DBX (I'll soon move to Databricks Asset Bundle, but it doesn't change anything in my situation) to deploy a Python application to Databricks. I'm also using Poetry to manage my libraries and dependencies.

My project looks like this :

Project A
├── Folders A
├── main.py
├── pyproject.toml
└── ~Project B
    ├── Folders B
    ├── main.py
    └── pyproject.toml

Project B is a submodule with its own libraries and dependencies. In order to avoid double import or to manage some libraries in Project A that are only used on Project B, I import the Project B into Project A's 'pyproject.toml' file.

Project A's toml file :

[tool.poetry.dependencies]
python = "^3.10"
dbx = "^0.8.15"
project_b = { path = "./project_b", develop = true }

By doing so, the poetry.lock file from Project A includes the defined libraries in the current pyproject.toml + all the missing ones I could need from Project B.

In order to deploy my code to Databricks, DBX builds a wheel file with the following METADATA information :

Metadata-Version: 2.1
Name: project_a
Version: 1.0.0
Summary: some desc
Author: me
Author-email: me@me.com
Requires-Python: >=3.10,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: dbx (>=0.8.15,<0.9.0)
Requires-Dist: project_b @ file:///project_b

We can see "dbx" and "project_b" as defined in the pyproject.toml file from Project A.

It fails then on Databricks when I try to run my job (that is deployed with DBX using a deployment.yml file) with the following error message :

24/04/08 14:27:18 WARN LibraryState: [Thread 168] Failed to install library dbfs:/FileStore/my_location/4a4d6b50a44742d9be58fc544f272fd0/artifacts/dist/project_a-1.0.0-py3-none-any.whl
org.apache.spark.SparkException: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install --upgrade /local_disk0/tmp/addedFile556d8bb0a3244f44834308af3f689c807372305198373422453/project_a-1.0.0-py3-none-any.whl --disable-pip-version-check) exited with code 1. ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/project_b'

My assumption is Databricks doesn't know how to dynamically get the full path of the wheel file. In this case, it should be "dbfs:/FileStore/my_location/4a4d6b50a44742d9be58fc544f272fd0/artifacts/project_b".

DBX allows me to use a relative path when I deploy a job, like this :

environments:
  local:
    workflows:
      - name: dbx_execute_job
        spark_python_task:
          python_file: file://main.py
          parameters:
            - '--config'
            - 'file:fuse://conf/jobs/config.yaml'

Where "file://main.py" will point to "dbfs:/FileStore/my_location/4a4d6b50a44742d9be58fc544f272fd0/artifacts/main.py"
There's also a significant difference between the path I give on the deployment.yml file (with file://, only 2 slashes) and how poetry deals with it (file:///project_b, with 3 slashes).

I don't know if what I'm trying to do is achievable, but in the end I would like to be able to deploy a python application, with a submodule in it, without listing all the libraries from Project B on Project A's pyproject.toml file.

I would appreciate any help !

Thank you

Deploy python application with submodules - Poetry library management