cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Azure Data Factory: allocate resources per Notebook

Giorgi
Contributor

I'm using Azure Data Factory to create pipeline of Databricks notebooks,

something like this:

[Notebook 1 - data pre-processing ] -> [Notebook 2 - model training ] -> [Notebook 3 - performance evaluation].

Can I write some config file, that would allow to allocate resources per Notebook (Brick)?

Suppose data pre-processing requires 40 workers, when performance evaluation can be done only with 1 worker.

Thank you!

Giorgi

1 ACCEPTED SOLUTION

Accepted Solutions

Thanks for your answer!

Different cluster per notebook does what I need, for now.

Solution with REST API 2.0 to resize cluster seems more flexible way to go. I guess it should be possible to create clusters on demand, from JSON configs, via curl command?

Ideally, I'd like to achieve ADF pipeline deployment from some code (JSON or python), where I can configure resources to be used at each step (on demand), and with packages versions to be installed at each cluster.

Thank you!

Giorgi

View solution in original post

2 REPLIES 2

Hubert-Dudek
Esteemed Contributor III

I understand that, in your case, auto-scaling will take too much time.

The simplest option is to use a different cluster for another notebook (and be sure that the previous cluster is terminated instantly).

Another option is to use REST API 2.0/clusters/resize to resize the cluster https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/clusters#--resize

There is also a magic option to do it from a notebook, and I am including a script detecting all required parameters.

import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
domain_name = ctx.tags().get("browserHostName").get()
cluster_id = ctx.clusterId().get()
host_token = ctx.apiToken().get()
 
requests.post(
    f'https://{domain_name}/api/2.0/clusters/resize',
    headers={'Authorization': f'Bearer {host_token}'},
    json={ "cluster_id": cluster_id, "num_workers": 2 }
  )

Thanks for your answer!

Different cluster per notebook does what I need, for now.

Solution with REST API 2.0 to resize cluster seems more flexible way to go. I guess it should be possible to create clusters on demand, from JSON configs, via curl command?

Ideally, I'd like to achieve ADF pipeline deployment from some code (JSON or python), where I can configure resources to be used at each step (on demand), and with packages versions to be installed at each cluster.

Thank you!

Giorgi

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group