Deploy python application with submodules - Poetry library management

57410 — Mon, 08 Apr 2024 15:54:31 GMT

Hi,

I'm using DBX (I'll soon move to Databricks Asset Bundle, but it doesn't change anything in my situation) to deploy a Python application to Databricks. I'm also using Poetry to manage my libraries and dependencies.

My project looks like this :

Project A ├── Folders A ├── main.py ├── pyproject.toml └── ~Project B ├── Folders B ├── main.py └── pyproject.toml

Project B is a submodule with its own libraries and dependencies. In order to avoid double import or to manage some libraries in Project A that are only used on Project B, I import the Project B into Project A's 'pyproject.toml' file.

Project A's toml file :

[tool.poetry.dependencies] python = "^3.10" dbx = "^0.8.15" project_b = { path = "./project_b", develop = true }

By doing so, the poetry.lock file from Project A includes the defined libraries in the current pyproject.toml + all the missing ones I could need from Project B.

In order to deploy my code to Databricks, DBX builds a wheel file with the following METADATA information :

Metadata-Version: 2.1 Name: project_a Version: 1.0.0 Summary: some desc Author: me Author-email: me@me.com Requires-Python: >=3.10,<4.0 Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3.11 Classifier: Programming Language :: Python :: 3.12 Requires-Dist: dbx (>=0.8.15,<0.9.0) Requires-Dist: project_b @ file:///project_b

We can see "dbx" and "project_b" as defined in the pyproject.toml file from Project A.

It fails then on Databricks when I try to run my job (that is deployed with DBX using a deployment.yml file) with the following error message :

24/04/08 14:27:18 WARN LibraryState: [Thread 168] Failed to install library dbfs:/FileStore/my_location/4a4d6b50a44742d9be58fc544f272fd0/artifacts/dist/project_a-1.0.0-py3-none-any.whl org.apache.spark.SparkException: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install --upgrade /local_disk0/tmp/addedFile556d8bb0a3244f44834308af3f689c807372305198373422453/project_a-1.0.0-py3-none-any.whl --disable-pip-version-check) exited with code 1. ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/project_b'

My assumption is Databricks doesn't know how to dynamically get the full path of the wheel file. In this case, it should be "dbfs:/FileStore/my_location/4a4d6b50a44742d9be58fc544f272fd0/artifacts/project_b".

DBX allows me to use a relative path when I deploy a job, like this :

environments: local: workflows: - name: dbx_execute_job spark_python_task: python_file: file://main.py parameters: - '--config' - 'file:fuse://conf/jobs/config.yaml'

Where "file://main.py" will point to "dbfs:/FileStore/my_location/4a4d6b50a44742d9be58fc544f272fd0/artifacts/main.py"
There's also a significant difference between the path I give on the deployment.yml file (with file://, only 2 slashes) and how poetry deals with it (file:///project_b, with 3 slashes).

I don't know if what I'm trying to do is achievable, but in the end I would like to be able to deploy a python application, with a submodule in it, without listing all the libraries from Project B on Project A's pyproject.toml file.

I would appreciate any help !

Thank you

topic Deploy python application with submodules - Poetry library management in Data Engineering

Deploy python application with submodules - Poetry library management