cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Azure Data Factory: allocate resources per Notebook

Giorgi
New Contributor III

I'm using Azure Data Factory to create pipeline of Databricks notebooks,

something like this:

[Notebook 1 - data pre-processing ] -> [Notebook 2 - model training ] -> [Notebook 3 - performance evaluation].

Can I write some config file, that would allow to allocate resources per Notebook (Brick)?

Suppose data pre-processing requires 40 workers, when performance evaluation can be done only with 1 worker.

Thank you!

Giorgi

1 ACCEPTED SOLUTION

Accepted Solutions

Giorgi
New Contributor III

Thanks for your answer!

Different cluster per notebook does what I need, for now.

Solution with REST API 2.0 to resize cluster seems more flexible way to go. I guess it should be possible to create clusters on demand, from JSON configs, via curl command?

Ideally, I'd like to achieve ADF pipeline deployment from some code (JSON or python), where I can configure resources to be used at each step (on demand), and with packages versions to be installed at each cluster.

Thank you!

Giorgi

View solution in original post

2 REPLIES 2

Hubert-Dudek
Esteemed Contributor III

I understand that, in your case, auto-scaling will take too much time.

The simplest option is to use a different cluster for another notebook (and be sure that the previous cluster is terminated instantly).

Another option is to use REST API 2.0/clusters/resize to resize the cluster https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/clusters#--resize

There is also a magic option to do it from a notebook, and I am including a script detecting all required parameters.

import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
domain_name = ctx.tags().get("browserHostName").get()
cluster_id = ctx.clusterId().get()
host_token = ctx.apiToken().get()
 
requests.post(
    f'https://{domain_name}/api/2.0/clusters/resize',
    headers={'Authorization': f'Bearer {host_token}'},
    json={ "cluster_id": cluster_id, "num_workers": 2 }
  )

Giorgi
New Contributor III

Thanks for your answer!

Different cluster per notebook does what I need, for now.

Solution with REST API 2.0 to resize cluster seems more flexible way to go. I guess it should be possible to create clusters on demand, from JSON configs, via curl command?

Ideally, I'd like to achieve ADF pipeline deployment from some code (JSON or python), where I can configure resources to be used at each step (on demand), and with packages versions to be installed at each cluster.

Thank you!

Giorgi

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.