Databricks

Giorgi · ‎08-30-2022

I'm using Azure Data Factory to create pipeline of Databricks notebooks,

something like this:

[Notebook 1 - data pre-processing ] -> [Notebook 2 - model training ] -> [Notebook 3 - performance evaluation].

Can I write some config file, that would allow to allocate resources per Notebook (Brick)?

Suppose data pre-processing requires 40 workers, when performance evaluation can be done only with 1 worker.

Thank you!

Giorgi

Giorgi · ‎08-31-2022

Thanks for your answer!

Different cluster per notebook does what I need, for now.

Solution with REST API 2.0 to resize cluster seems more flexible way to go. I guess it should be possible to create clusters on demand, from JSON configs, via curl command?

Ideally, I'd like to achieve ADF pipeline deployment from some code (JSON or python), where I can configure resources to be used at each step (on demand), and with packages versions to be installed at each cluster.

Thank you!

Giorgi

View solution in original post

Hubert-Dudek · ‎08-30-2022

I understand that, in your case, auto-scaling will take too much time.

The simplest option is to use a different cluster for another notebook (and be sure that the previous cluster is terminated instantly).

Another option is to use REST API 2.0/clusters/resize to resize the cluster https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/clusters#--resize

There is also a magic option to do it from a notebook, and I am including a script detecting all required parameters.

import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
domain_name = ctx.tags().get("browserHostName").get()
cluster_id = ctx.clusterId().get()
host_token = ctx.apiToken().get()
 
requests.post(
    f'https://{domain_name}/api/2.0/clusters/resize',
    headers={'Authorization': f'Bearer {host_token}'},
    json={ "cluster_id": cluster_id, "num_workers": 2 }
  )

Giorgi · ‎08-31-2022

Thanks for your answer!

Different cluster per notebook does what I need, for now.

Solution with REST API 2.0 to resize cluster seems more flexible way to go. I guess it should be possible to create clusters on demand, from JSON configs, via curl command?

Ideally, I'd like to achieve ADF pipeline deployment from some code (JSON or python), where I can configure resources to be used at each step (on demand), and with packages versions to be installed at each cluster.

Thank you!

Giorgi

Databricks

Azure Data Factory: allocate resources per Notebook

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs