- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-30-2022 05:57 AM
I'm using Azure Data Factory to create pipeline of Databricks notebooks,
something like this:
[Notebook 1 - data pre-processing ] -> [Notebook 2 - model training ] -> [Notebook 3 - performance evaluation].
Can I write some config file, that would allow to allocate resources per Notebook (Brick)?
Suppose data pre-processing requires 40 workers, when performance evaluation can be done only with 1 worker.
Thank you!
Giorgi
- Labels:
-
Databricks notebook
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-31-2022 02:39 AM
Thanks for your answer!
Different cluster per notebook does what I need, for now.
Solution with REST API 2.0 to resize cluster seems more flexible way to go. I guess it should be possible to create clusters on demand, from JSON configs, via curl command?
Ideally, I'd like to achieve ADF pipeline deployment from some code (JSON or python), where I can configure resources to be used at each step (on demand), and with packages versions to be installed at each cluster.
Thank you!
Giorgi
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-30-2022 10:07 AM
I understand that, in your case, auto-scaling will take too much time.
The simplest option is to use a different cluster for another notebook (and be sure that the previous cluster is terminated instantly).
Another option is to use REST API 2.0/clusters/resize to resize the cluster https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/clusters#--resize
There is also a magic option to do it from a notebook, and I am including a script detecting all required parameters.
import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
domain_name = ctx.tags().get("browserHostName").get()
cluster_id = ctx.clusterId().get()
host_token = ctx.apiToken().get()
requests.post(
f'https://{domain_name}/api/2.0/clusters/resize',
headers={'Authorization': f'Bearer {host_token}'},
json={ "cluster_id": cluster_id, "num_workers": 2 }
)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-31-2022 02:39 AM
Thanks for your answer!
Different cluster per notebook does what I need, for now.
Solution with REST API 2.0 to resize cluster seems more flexible way to go. I guess it should be possible to create clusters on demand, from JSON configs, via curl command?
Ideally, I'd like to achieve ADF pipeline deployment from some code (JSON or python), where I can configure resources to be used at each step (on demand), and with packages versions to be installed at each cluster.
Thank you!
Giorgi

