topic Re: Databrikcs job cli in Data Engineering

Databrikcs job cli

Orianh — Mon, 06 Jun 2022 15:28:26 GMT

Hey guys,

I'm trying to create a job via databricks cli, This job is going to use a wheell file that I already upload to dbfs and exported from this package the entry point that needed for the job.

In the UI I can see that the job has been created, But when I'm trying to run this job Im getting an error that i need Manage access in order to install libraries on a cluster( Cluster scooped libraries) .

My questions are -

Is there a way to create a job via databricks cli and to install packages but in a Notebook scooped manner? ( Without getting a Manage access for a cluster)
Let's say instead of using existing cluster, while creating the job i will create a new cluster - should there be any problems to install libraries with this way?

my job_config.json file:

{
  "name": "test_databricks_cli_jobs",
  "tasks": [
    {"task_key": "Test_train_entrypoint",
      "description": "test print in train entrypoint",
      "depends_on": [],
      "existing_cluster_id": "Myicluster-id",
      "python_wheel_task": {
        "package_name":  "testpack",
        "entry_point": "train",
        "parameters": ["Random", "This is a test message"]
      },
      "libraries": [
      {"whl": "/dbfs/FileStore/jars/test/testpack-0.0.1-py3-none-any.whl"}
  ]
 
    }
  ]
 
}

command to deploy the job:

databricks jobs create --json-file job_config.json --version=2.1

Hope some one can help me.

Thanks!

Re: Databrikcs job cli

Orianh — Tue, 07 Jun 2022 14:42:33 GMT

Hey Kaniz, Thanks for your answer.

I'm not sure you understood me - I will try to make it more clear.

I'm trying to automate ML training process for our developers using databricks.

When developer finish with all his code - I packaging it into a wheel file and upload it into databricks file system, This package have an entry points let's call it train for now.

I managed to create a job using the CLI with all the needed configuration but when im trying to run the job im getting an error - cluster Manage access needed to install cluster libraries.

All the code in the wheel file are .py files.

In the job_config.json file i declared on libraries needed for the job to run ( e.g. the wheel file that have been already uploaded)

Is there a way to run the job without getting Manage access error? - to install the library just for the job scope and not for the cluster? ( like notebook scope lib)

Hope its more clear now, If not let me know and i will try to explain better

Re: Databrikcs job cli

Vivian_Wilfred — Wed, 08 Jun 2022 14:36:51 GMT

Hi @orian hindi , adding the wheel package in the "libraries" section of json file will always try to install the whl on a cluster level that requires manage access, irrespective of job cluster or an existing interactive cluster. You cannot achieve it this way without having the necessary permission on the cluster.

Have you tried to install the whl directly on your code/notebook that is attached to the job run?

https://docs.databricks.com/libraries/notebooks-python-libraries.html#install-a-package-from-dbfs-with-pip

%pip install /dbfs/mypackage-0.0.1-py3-none-any.whl

This will install the library just for the job run scope and not on the cluster.

Re: Databrikcs job cli

Orianh — Wed, 08 Jun 2022 14:48:30 GMT

Hey Vivian, Thanks for the answer.

I got permission to create clusters for now, Instead of using existing cluster - each job will be linked to new cluster for the run - Its solves the problem of the permissions to install lib on the cluster. ( in the config_job.json instead of using existing passed cluster spec to new_cluster key).

After i managed to install the library i faced a new problem and you might help me with that..

I set an entry point name train - train its a function in my package that gets 2 params -(name, message)

should this entrypoint need to set with setup.py entry_points filed? or inside my init module should i export the function? -- from .file import train?

When i tried to export function that don't get any params its works fine only by exporting the function in the init file -- from .file import print_name

Hope i explained my problem and you can help me,

Thanks!

Re: Databrikcs job cli

Vivian_Wilfred — Wed, 08 Jun 2022 17:53:37 GMT

An example-

Package name is test_wheel and the entry point is hello_world:

The package name refers to the folder in my project that contains the __init__.py and the entry point is the method to call.

code.py (under test_wheel) contains a method named hello_world which just prints helloWorld. We import hello_world in __init__.py so that it is available at the root of the package.

In setup.py we include the test_wheel package. After building it, we upload the wheel as part of the job task. The job will print "helloWorld" in its logs.

In your case, in setup.py you could add test_wheel.code:hello_world for entry points.

Re: Databrikcs job cli

Orianh — Thu, 09 Jun 2022 08:30:56 GMT

Hey @Vivian Wilfred ,

When I'm setting a function as entry points without any parameters, everything works.

I'M GETTING AN ERROR when I'm trying to set a function with params as an entry point.

How do Databricks pass those parameters to the entry point?

My package structure is minimal:

-testpack

--init.py

--main.py

sharing some code:

__init.py__ file:
 
from .main import train
 
---
main.py file:
 
def train(name, message):
  print(f"{name} said {message}")
 
 
---
tried in setup.py to add entry_points few things that didnt worked( not sure im currect):
 
setuptools.setup(
...,
 entry_points={
            'train': [
                'train=testpack.main:train' / 'train=main:train'
            ]
        }

Hope you can help me, Thanks.

Re: Databrikcs job cli

Orianh — Mon, 13 Jun 2022 10:45:16 GMT

Hey Kaniz, Sorry for the late response.

I think i figured it out, Databricks pass parameters to entry point via cmd.

There is two ways to set an entry point for a job -

1) Using entry point in setup.py - like Vivian mentioned in the above answer.

2) export function from init file of the package ( e.g. from .main import func)

Those two approaches must don't get any parameters, the parameters passed via cmd so you can get the params using argparser or from sys.argv