Databricks

Orianh · ‎06-06-2022

Hey guys,

I'm trying to create a job via databricks cli, This job is going to use a wheell file that I already upload to dbfs and exported from this package the entry point that needed for the job.

In the UI I can see that the job has been created, But when I'm trying to run this job Im getting an error that i need Manage access in order to install libraries on a cluster( Cluster scooped libraries) .

My questions are -

Is there a way to create a job via databricks cli and to install packages but in a Notebook scooped manner? ( Without getting a Manage access for a cluster)
Let's say instead of using existing cluster, while creating the job i will create a new cluster - should there be any problems to install libraries with this way?

my job_config.json file:

{
  "name": "test_databricks_cli_jobs",
  "tasks": [
    {"task_key": "Test_train_entrypoint",
      "description": "test print in train entrypoint",
      "depends_on": [],
      "existing_cluster_id": "Myicluster-id",
      "python_wheel_task": {
        "package_name":  "testpack",
        "entry_point": "train",
        "parameters": ["Random", "This is a test message"]
      },
      "libraries": [
      {"whl": "/dbfs/FileStore/jars/test/testpack-0.0.1-py3-none-any.whl"}
  ]
 
    }
  ]
 
}

command to deploy the job:

databricks jobs create --json-file job_config.json --version=2.1

Hope some one can help me.

Thanks!

Orianh · ‎06-13-2022

Hey Kaniz, Sorry for the late response.

I think i figured it out, Databricks pass parameters to entry point via cmd.

There is two ways to set an entry point for a job -

1) Using entry point in setup.py - like Vivian mentioned in the above answer.

2) export function from init file of the package ( e.g. from .main import func)

Those two approaches must don't get any parameters, the parameters passed via cmd so you can get the params using argparser or from sys.argv

View solution in original post

Kaniz · ‎06-07-2022

Hi @orian hindi, You run Databricks jobs CLI subcommands by appending them to Databricks jobs,

and Databricks job runs CLI subcommands by appending them to Databricks runs. For Databricks job run..., see the Runs CLI.

Important:-

The Databricks jobs CLI supports calls to two versions of the Databricks Jobs REST API: versions 2.1 and 2.0. Version 2.1 supports orchestration of jobs with multiple tasks; see Workflows with assignments and Jobs API updates.

Databricks recommends that you call version 2.1 unless you have legacy scripts that rely on version 2.0 and cannot be migrated.

Unless otherwise specified, the programmatic behaviors described in this article apply equally to versions 2.1 and 2.0.

There are two methods for installing notebook-scoped libraries:

Run the %pip magic command in a notebook. The %pip command is supported on Databricks Runtime 7.1 and above and on Databricks Runtime 6.4 ML and above. Databricks recommends using this approach for new workloads. This article describes how to use these magic commands.
On Databricks Runtime 10.5 and below, you can use the Databricks library utility. The library utility is supported only on Databricks Runtime, not Databricks Runtime ML or Databricks Runtime for Genomics. See Library utility (dbutils. library).

To install libraries for all notebooks attached to a cluster, use workspace or cluster-installed libraries.

Important:-’dbutils.library.install’ and ’dbutils.library.installPyPI’ APIs are removed in Databricks Runtime 11.0.

Orianh · ‎06-07-2022

Hey Kaniz, Thanks for your answer.

I'm not sure you understood me - I will try to make it more clear.

I'm trying to automate ML training process for our developers using databricks.

When developer finish with all his code - I packaging it into a wheel file and upload it into databricks file system, This package have an entry points let's call it train for now.

I managed to create a job using the CLI with all the needed configuration but when im trying to run the job im getting an error - cluster Manage access needed to install cluster libraries.

All the code in the wheel file are .py files.

In the job_config.json file i declared on libraries needed for the job to run ( e.g. the wheel file that have been already uploaded)

Is there a way to run the job without getting Manage access error? - to install the library just for the job scope and not for the cluster? ( like notebook scope lib)

Hope its more clear now, If not let me know and i will try to explain better

Vivian_Wilfred · ‎06-08-2022

Hi @orian hindi , adding the wheel package in the "libraries" section of json file will always try to install the whl on a cluster level that requires manage access, irrespective of job cluster or an existing interactive cluster. You cannot achieve it this way without having the necessary permission on the cluster.

Have you tried to install the whl directly on your code/notebook that is attached to the job run?

https://docs.databricks.com/libraries/notebooks-python-libraries.html#install-a-package-from-dbfs-wi...

%pip install /dbfs/mypackage-0.0.1-py3-none-any.whl

This will install the library just for the job run scope and not on the cluster.

Orianh · ‎06-08-2022

Hey Vivian, Thanks for the answer.

I got permission to create clusters for now, Instead of using existing cluster - each job will be linked to new cluster for the run - Its solves the problem of the permissions to install lib on the cluster. ( in the config_job.json instead of using existing passed cluster spec to new_cluster key).

After i managed to install the library i faced a new problem and you might help me with that..

I set an entry point name train - train its a function in my package that gets 2 params -(name, message)

should this entrypoint need to set with setup.py entry_points filed? or inside my init module should i export the function? -- from .file import train?

When i tried to export function that don't get any params its works fine only by exporting the function in the init file -- from .file import print_name

Hope i explained my problem and you can help me,

Thanks!

Vivian_Wilfred · ‎06-08-2022

An example-

Package name is test_wheel and the entry point is hello_world:

The package name refers to the folder in my project that contains the __init__.py and the entry point is the method to call.

code.py (under test_wheel) contains a method named hello_world which just prints helloWorld. We import hello_world in __init__.py so that it is available at the root of the package.

In setup.py we include the test_wheel package. After building it, we upload the wheel as part of the job task. The job will print "helloWorld" in its logs.

In your case, in setup.py you could add test_wheel.code:hello_world for entry points.

Orianh · ‎06-09-2022

Hey @Vivian Wilfred ,

When I'm setting a function as entry points without any parameters, everything works.

I'M GETTING AN ERROR when I'm trying to set a function with params as an entry point.

How do Databricks pass those parameters to the entry point?

My package structure is minimal:

-testpack

--init.py

--main.py

sharing some code:

__init.py__ file:
 
from .main import train
 
---
main.py file:
 
def train(name, message):
  print(f"{name} said {message}")
 
 
---
tried in setup.py to add entry_points few things that didnt worked( not sure im currect):
 
setuptools.setup(
...,
 entry_points={
            'train': [
                'train=testpack.main:train' / 'train=main:train'
            ]
        }

Hope you can help me, Thanks.

Kaniz · ‎06-10-2022

Hi @orian hindi, We have a community thread with a similar issue you've mentioned. Please let us know if you get some solution through it. Thanks.

Kaniz · ‎06-13-2022

Hi @orian hindi , We haven’t heard from you on the last response from me, and I was checking back to see if you have a resolution yet. If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.