Hi,
I'm using DBX (I'll soon move to Databricks Asset Bundle, but it doesn't change anything in my situation) to deploy a Python application to Databricks. I'm also using Poetry to manage my libraries and dependencies.
My project looks like this :
Project A
├── Folders A
├── main.py
├── pyproject.toml
└── ~Project B
├── Folders B
├── main.py
└── pyproject.toml
Project B is a submodule with its own libraries and dependencies. In order to avoid double import or to manage some libraries in Project A that are only used on Project B, I import the Project B into Project A's 'pyproject.toml' file.
Project A's toml file :
[tool.poetry.dependencies]
python = "^3.10"
dbx = "^0.8.15"
project_b = { path = "./project_b", develop = true }
By doing so, the poetry.lock file from Project A includes the defined libraries in the current pyproject.toml + all the missing ones I could need from Project B.
In order to deploy my code to Databricks, DBX builds a wheel file with the following METADATA information :
Metadata-Version: 2.1
Name: project_a
Version: 1.0.0
Summary: some desc
Author: me
Author-email: me@me.com
Requires-Python: >=3.10,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: dbx (>=0.8.15,<0.9.0)
Requires-Dist: project_b @ file:///project_b
We can see "dbx" and "project_b" as defined in the pyproject.toml file from Project A.
It fails then on Databricks when I try to run my job (that is deployed with DBX using a deployment.yml file) with the following error message :
24/04/08 14:27:18 WARN LibraryState: [Thread 168] Failed to install library dbfs:/FileStore/my_location/4a4d6b50a44742d9be58fc544f272fd0/artifacts/dist/project_a-1.0.0-py3-none-any.whl
org.apache.spark.SparkException: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install --upgrade /local_disk0/tmp/addedFile556d8bb0a3244f44834308af3f689c807372305198373422453/project_a-1.0.0-py3-none-any.whl --disable-pip-version-check) exited with code 1. ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/project_b'
My assumption is Databricks doesn't know how to dynamically get the full path of the wheel file. In this case, it should be "dbfs:/FileStore/my_location/4a4d6b50a44742d9be58fc544f272fd0/artifacts/project_b".
DBX allows me to use a relative path when I deploy a job, like this :
environments:
local:
workflows:
- name: dbx_execute_job
spark_python_task:
python_file: file://main.py
parameters:
- '--config'
- 'file:fuse://conf/jobs/config.yaml'
Where "file://main.py" will point to "dbfs:/FileStore/my_location/4a4d6b50a44742d9be58fc544f272fd0/artifacts/main.py"
There's also a significant difference between the path I give on the deployment.yml file (with file://, only 2 slashes) and how poetry deals with it (file:///project_b, with 3 slashes).
I don't know if what I'm trying to do is achievable, but in the end I would like to be able to deploy a python application, with a submodule in it, without listing all the libraries from Project B on Project A's pyproject.toml file.
I would appreciate any help !
Thank you