cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Deploy python application with submodules - Poetry library management

57410
New Contributor

Hi,

I'm using DBX (I'll soon move to Databricks Asset Bundle, but it doesn't change anything in my situation) to deploy a Python application to Databricks. I'm also using Poetry to manage my libraries and dependencies.

My project looks like this :

Project A
├── Folders A
├── main.py
├── pyproject.toml
└── ~Project B
    ├── Folders B
    ├── main.py
    └── pyproject.toml

Project B is a submodule with its own libraries and dependencies. In order to avoid double import or to manage some libraries in Project A that are only used on Project B, I import the Project B into Project A's 'pyproject.toml' file.

Project A's toml file :

[tool.poetry.dependencies]
python = "^3.10"
dbx = "^0.8.15"
project_b = { path = "./project_b", develop = true }

By doing so, the poetry.lock file from Project A includes the defined libraries in the current pyproject.toml + all the missing ones I could need from Project B.

In order to deploy my code to Databricks, DBX builds a wheel file with the following METADATA information :

Metadata-Version: 2.1
Name: project_a
Version: 1.0.0
Summary: some desc
Author: me
Author-email: me@me.com
Requires-Python: >=3.10,<4.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: dbx (>=0.8.15,<0.9.0)
Requires-Dist: project_b @ file:///project_b

We can see "dbx" and "project_b" as defined in the pyproject.toml file from Project A.

It fails then on Databricks when I try to run my job (that is deployed with DBX using a deployment.yml file) with the following error message :

24/04/08 14:27:18 WARN LibraryState: [Thread 168] Failed to install library dbfs:/FileStore/my_location/4a4d6b50a44742d9be58fc544f272fd0/artifacts/dist/project_a-1.0.0-py3-none-any.whl
org.apache.spark.SparkException: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh /local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install --upgrade /local_disk0/tmp/addedFile556d8bb0a3244f44834308af3f689c807372305198373422453/project_a-1.0.0-py3-none-any.whl --disable-pip-version-check) exited with code 1. ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/project_b'

My assumption is Databricks doesn't know how to dynamically get the full path of the wheel file. In this case, it should be "dbfs:/FileStore/my_location/4a4d6b50a44742d9be58fc544f272fd0/artifacts/project_b".

DBX allows me to use a relative path when I deploy a job, like this :

environments:
  local:
    workflows:
      - name: dbx_execute_job
        spark_python_task:
          python_file: file://main.py
          parameters:
            - '--config'
            - 'file:fuse://conf/jobs/config.yaml'

Where "file://main.py" will point to "dbfs:/FileStore/my_location/4a4d6b50a44742d9be58fc544f272fd0/artifacts/main.py"
There's also a significant difference between the path I give on the deployment.yml file (with file://, only 2 slashes) and how poetry deals with it (file:///project_b, with 3 slashes).

I don't know if what I'm trying to do is achievable, but in the end I would like to be able to deploy a python application, with a submodule in it, without listing all the libraries from Project B on Project A's pyproject.toml file.

I would appreciate any help !

Thank you

 

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @57410It seems you’re transitioning from DBX to Databricks Asset Bundles (DABs) for managing your complex data, analytics, and ML projects on the Databricks platform.

Let’s dive into the details and address the issue you’re facing.

Databricks Asset Bundles (DABs)

DABs are a powerful tool for streamlining the development of intricate projects. They provide CI/CD capabilities in your software development workflow using a concise and declarative YAML syntax. By automating tests, deployments, and configuration management, DABs reduce errors and promote best practices across your organization.

Here’s how DABs work:

  1. Metadata and Source Files: A bundle includes the following components:

    • Required cloud infrastructure and workspace configurations
    • Source files (such as notebooks and Python files) containing business logic
    • Definitions and settings for Databricks resources (jobs, pipelines, endpoints, etc.)
    • Unit tests and integration tests
  2. Ideal Scenarios for DABs:

    • Team-Based Development: Use DABs to manage complex projects collaboratively.
    • ML Iteration: Streamline ML pipeline resources (training, inference jobs) following production best practices.
    • Standardization: Set organizational standards by creating custom bundle templates with default permissions, service principals, and CI/CD configurations.
    • Regulatory Compliance: Maintain a versioned history of code and infrastructure for governance and compliance.

Your Scenario

In your case, transitioning from DBX to DABs is a smart move. Let’s address the issue you encountered during deployment:

  1. Error Message:

    WARN LibraryState: [Thread 168] Failed to install library dbfs:/FileStore/my_location/4a4d6b50a44742d9be58fc544f272fd0/artifacts/dist/project_a-1.0.0-py3-none-any.whl
    org.apache.spark.SparkException: Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clu
    
  2. Possible Causes:

    • Incorrect bundle configuration.
    • Dependency issues related to the transition from DBX to DABs.
  3. Next Steps:

    • Review Bundle Configuration: Ensure that your DAB configuration is accurate, including the metadata and source files.
    • Dependency Management: Double-check dependencies (especially dbx and project_b) in your bundle. Make sure they are correctly specified in your DAB’s YAML file.

Remember, DABs simplify project management, enhance collaboration, and ensure consistency. If you encounter further issues, feel free to ask for assistance! 🚀

 
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!