cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to specify entry_point for python_wheel_task?

jpwp
New Contributor III

Can someone provide me an example for a python_wheel_task and what the entry_point field should be?

The jobs UI help popup says this about "entry_point":

"Function to call when starting the wheel, for example: main. If the entry point does not exist in the meta-data of the wheel distribution, the function will be called directly using `$packageName.$entryPoint()`."

However, an entry point in python is a combination of the group and a name. e.g. in my setup.py

entry_points = {
    'my_jobs': [
        'a_job = module_name:job_function'
   ]
},

Here the group is "my_jobs" and the name is "a_job". For databricks, should I make entry_point `a_job`, `my_jobs.a_job`, or does databricks require a specific group name for wheels that are run as tasks?

I couldn't find any documentation online to clarify this.

4 ACCEPTED SOLUTIONS

Accepted Solutions

jpwp
New Contributor III

The correct answer to my question is that the "entry_point" in the databricks API has nothing to do with a python wheel's official "entry_point"s. It is just a dotted python path to a python function. e.g. `mymodule.myfunction`

View solution in original post

hectorfi
New Contributor III

Just in case anyone comes here in the future, this is kind of how Databricks executes these entry points... How I know? I have banged my head against this wall for a couple of hours already.

from importlib import metadata

package_name = "some.package"
entry_point = "my-entry-point"

available_entry_point = metadata.distribution(package_name).entry_points
entry = [ep for ep in available_entry_points if ep.name = entry_point]

if entry:
    enstry[0].load()()
else:
    # Imagine that <package-name> is replaced with the package name provided
    # and same for <entry-point>
    import <package-name>
    <package-name>.<entry-point>()

If you cannot see your entry point usint the following, then you (we) are out of luck.

from importlib import metadata
from pprint import pprint

package_name = "my.package"
pprint(metadata.distribution(package_name).entry_points)

My current working theory is that the user installing the package is not the same user running the execution. Or for some weird reason the metadata is not available at the job runtime...

View solution in original post

GabMorin
New Contributor III

How do you build the wheel? I got it working with poetry like so:

entrypoint.py somewhere in your codebase:

 

def entrypoint():
    print("Works")

 

pyproject.toml:

 

[tool.poetry]
name = "package"
version = "1.0.0"
description = "package"
packages = [{include = "src"}, ]  # assuming you have the /src structure

[tool.poetry.scripts]
my_entrypoint = "src.entrypoint:entrypoint" # before : is path to file and after is method name

 


Job config:
  package_name: 'package' --> taken from pyproject.toml
  entrypoint: 'my_entrypoint' --> taken from the pyproject.toml before the `=` of your entrypoint line

(assuming your installed the wheel on the cluster)

I also pulled my hair out over this and am now bald.


FYI, my full setup is micromamba -> poetry -> gitlab -> pulumi -> databricks โ˜ ๏ธ

View solution in original post

hectorfi
New Contributor III

One thing to note when working with entry points is that if the name is too long, it may not work on Databricks. That was the cause of my issue.

View solution in original post

14 REPLIES 14

Kaniz_Fatma
Community Manager
Community Manager

Hi @Joel Pittโ€‹ ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Joel Pittโ€‹ ,

I believe that this should be achievable by the specification of the libraries field (see docs).

Can you try something like this?:

{
  "existing_cluster_id": <cluster_id>,
  "python_wheel_task": {
    "package_name": <package_name>,
    "entry_point": <entry_point>
  },
  "libraries": [
    { "whl": "dbfs:/FileStore/my-lib.whl" }
  ]
}

jpwp
New Contributor III

Hi Kaniz - I'm afraid that doesn't answer the question. I am asking about the expected value for the entry_point field. I am not trying to use an additional library, I am trying to run a python_wheel_task.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Joel Pittโ€‹ , Please go through this documentation. Let me know if this helps.

Screenshot 2022-01-31 at 1.57.08 PM

Anonymous
Not applicable

@Joel Pittโ€‹ - Let us know if either of Kaniz's resources helps you. If they do, would you be happy to mark that answer as best? That helps other members find the solutions more quickly.

jpwp
New Contributor III

The correct answer to my question is that the "entry_point" in the databricks API has nothing to do with a python wheel's official "entry_point"s. It is just a dotted python path to a python function. e.g. `mymodule.myfunction`

Kaniz_Fatma
Community Manager
Community Manager

Hi @jpwp , I want to express my gratitude for your effort in selecting the most suitable solution. It's great to hear that your query has been successfully resolved. Thank you for your contribution.




 

hectorfi
New Contributor III

Just in case anyone comes here in the future, this is kind of how Databricks executes these entry points... How I know? I have banged my head against this wall for a couple of hours already.

from importlib import metadata

package_name = "some.package"
entry_point = "my-entry-point"

available_entry_point = metadata.distribution(package_name).entry_points
entry = [ep for ep in available_entry_points if ep.name = entry_point]

if entry:
    enstry[0].load()()
else:
    # Imagine that <package-name> is replaced with the package name provided
    # and same for <entry-point>
    import <package-name>
    <package-name>.<entry-point>()

If you cannot see your entry point usint the following, then you (we) are out of luck.

from importlib import metadata
from pprint import pprint

package_name = "my.package"
pprint(metadata.distribution(package_name).entry_points)

My current working theory is that the user installing the package is not the same user running the execution. Or for some weird reason the metadata is not available at the job runtime...

GabMorin
New Contributor III

How do you build the wheel? I got it working with poetry like so:

entrypoint.py somewhere in your codebase:

 

def entrypoint():
    print("Works")

 

pyproject.toml:

 

[tool.poetry]
name = "package"
version = "1.0.0"
description = "package"
packages = [{include = "src"}, ]  # assuming you have the /src structure

[tool.poetry.scripts]
my_entrypoint = "src.entrypoint:entrypoint" # before : is path to file and after is method name

 


Job config:
  package_name: 'package' --> taken from pyproject.toml
  entrypoint: 'my_entrypoint' --> taken from the pyproject.toml before the `=` of your entrypoint line

(assuming your installed the wheel on the cluster)

I also pulled my hair out over this and am now bald.


FYI, my full setup is micromamba -> poetry -> gitlab -> pulumi -> databricks โ˜ ๏ธ

hectorfi
New Contributor III

One thing to note when working with entry points is that if the name is too long, it may not work on Databricks. That was the cause of my issue.

Hi @hectorfi, It's great to hear that your query has been successfully resolved. Thank you for your contribution.




 

VictorS
New Contributor II

You are a hero for supplying a full example - especially the validation part is great. Thanks dude!

MRMintechGlobal
New Contributor II

Just want to confirm - my project uses PDM not poetry

and as such uses

[project.entry-points.packages]

Rather than

[tool.poetry.scripts]

and the bundle is failing to run on the cluster - as it can't find the entry point - is this expected behavior?

My issue appears to have been uploading wheel with identical version numbers during development.

I've added dynamic versioning to the packages using git hash and timestamp to ensure the latest is installed and runs.

def get_version(version=__version__):
    try:
        import subprocess

        git_hash = subprocess.check_output(["git", "rev-parse", "--short", "HEAD"]).decode("ascii").strip()
        version += f"+{git_hash}-{int(time.time())}"
    except Exception as e:
        print(e)
    return version

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group