topic Re: Do I need many wheels for each job in project? in Data Engineering

Do I need many wheels for each job in project?

kmodelew — Tue, 01 Apr 2025 20:41:25 GMT

I have a project witch my commons, like sparksession object (to run code in pycharm using databricks connect library and the same code directly on databricks).I have under src a few packages from which DAB creates separate jobs. I'm using PyCharm. Structure of my project is as follows:

src/task_group1/<many_python_tasks>
src/task_group2<many_python_tasks>

resources/task_group1.yml #tasks and job structure
resources/task_group2.yml #tasks and job structure

tasks:
  - task_key: main_task
    job_cluster_key: job_cluster
    python_wheel_task:
      package_name: task_group1
      entry_point: main
    libraries:
      # By default we just include the .whl file generated for the bundle_test package.
      # See https://docs.databricks.com/dev-tools/bundles/library-dependencies.html
      # for more information on how to add other libraries.
      - whl: ../dist/*.whl

After running on Databricks I have this error: run failed with error message Python wheel with name task_group2 could not be found. Please check the driver logs for more details

Is Databricks Asset Bundles should generate many wheel files? Each *.whl file for each job? One wheel generated by DAB have all packages included. Is it a matter of wrong references in yml files and setup.py?

setup.py with corect entry points.

packages=find_packages(where="./src"),
package_dir={"": "src"},

Re: Do I need many wheels for each job in project?

Brahmareddy — Wed, 02 Apr 2025 02:57:52 GMT

Hi kmodelew,

How are you doing today?, As per my understanding, this is a common confusion when using Databricks Asset Bundles (DAB) with multiple task groups and a shared codebase. The key thing to know is that DAB generates one .whl file for the entire bundle, which includes all your packages under src/, not separate wheel files for each task group. So when your YAML is looking for a wheel specific to task_group2, Databricks can’t find it because the wheel is named after your top-level project, not the individual packages.

To fix this, you just need to reference the correct package name (matching your top-level wheel) in the package_name: field of each job in your .yml files, and make sure your setup.py includes all sub-packages via find_packages() like you're already doing. So instead of trying to set package_name: task_group1 or task_group2, use the actual package name defined in setup.py (e.g. my_project), and in each job, point to the correct entry point function under that namespace (e.g. my_project.task_group1.main). That should fix the “wheel not found” error and let all task groups run off the same wheel file. Let me know if you want help adjusting the setup.py or yml—happy to take a look!

Regards,

Brahma

Re: Do I need many wheels for each job in project?

kmodelew — Thu, 10 Apr 2025 09:35:46 GMT

Hi, I hope it would be usefuel. Here are my files:

project structure -> DAB_project_structure.png

each yml file for job definitions -> task_group_1_job.png and task_group_2_job.png

Each .py file has main() method.

setup.py:


description="wheel file based on bundle_test/src",
packages=find_packages(where="./src"),
package_dir={"": "src"},
entry_points={
    "packages": [
        "task_group_1_task_1=bundle_test.task_group_1.task_group_1_task_1:main",
        "task_group_2_task_2=bundle_test.task_group_2.task_group_1_task_2:main",

    ],
},
install_requires=[
    # Dependencies in case the output wheel file is used as a library dependency.
    # For defining dependencies, when this package is used in Databricks, see:
    # https://docs.databricks.com/dev-tools/bundles/library-dependencies.html
    "setuptools"
],