topic Re: Azure Data Factory: allocate resources per Notebook in Machine Learning

Azure Data Factory: allocate resources per Notebook

Giorgi — Tue, 30 Aug 2022 12:57:32 GMT

I'm using Azure Data Factory to create pipeline of Databricks notebooks,

something like this:

[Notebook 1 - data pre-processing ] -> [Notebook 2 - model training ] -> [Notebook 3 - performance evaluation].

Can I write some config file, that would allow to allocate resources per Notebook (Brick)?

Suppose data pre-processing requires 40 workers, when performance evaluation can be done only with 1 worker.

Thank you!

Giorgi

Re: Azure Data Factory: allocate resources per Notebook

Hubert-Dudek — Tue, 30 Aug 2022 17:07:55 GMT

I understand that, in your case, auto-scaling will take too much time.

The simplest option is to use a different cluster for another notebook (and be sure that the previous cluster is terminated instantly).

Another option is to use REST API 2.0/clusters/resize to resize the cluster https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/clusters#--resize

There is also a magic option to do it from a notebook, and I am including a script detecting all required parameters.

import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
domain_name = ctx.tags().get("browserHostName").get()
cluster_id = ctx.clusterId().get()
host_token = ctx.apiToken().get()
 
requests.post(
    f'https://{domain_name}/api/2.0/clusters/resize',
    headers={'Authorization': f'Bearer {host_token}'},
    json={ "cluster_id": cluster_id, "num_workers": 2 }
  )

Re: Azure Data Factory: allocate resources per Notebook

Giorgi — Wed, 31 Aug 2022 09:39:43 GMT

Thanks for your answer!

Different cluster per notebook does what I need, for now.

Solution with REST API 2.0 to resize cluster seems more flexible way to go. I guess it should be possible to create clusters on demand, from JSON configs, via curl command?

Ideally, I'd like to achieve ADF pipeline deployment from some code (JSON or python), where I can configure resources to be used at each step (on demand), and with packages versions to be installed at each cluster.

Thank you!

Giorgi