cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
cancel
Showing results for 
Search instead for 
Did you mean: 

UC Enabled cluster for ADF ingestion

ossinova
Contributor II

I am migrating my Data Lake to use Unity Catalog. However, this comes with changes to the clusters. I have tried a few options, but it seems rather complex than it should be.

I need to create a cluster used by ADF that is Unity Enabled that can install a JAR. From my testing, a shared cluster cannot use dbutils which I need to pass parameters (i.e tablename). It also does not allow libraries / JAR to be installed.

A single user interactive cluster seems like the right approach. However, I am not able to add the ADF service principal as a user.

A job cluster works. But I have many pipelines and databricks notebook jobs that runs daily. So it seems rather excessive to kickstart X clusters when one or two interactive cluster can be used

What is the right approach here for creating a cluster for ADF that is UC enabled, allows dbutils and can have a JAR installed on it?

Is running more Job clusters more expensive than one interactive all-purpose one?

8 REPLIES 8

-werners-
Esteemed Contributor III

I exclusively use job clusters. They are cheaper.

Especially when you create a pool with spot instances. I'd go for that because that is what it's made for: batch jobs/

@Werner Stinckens​ But if you have X pipeline with X Databricks Activities. Then won't it kickstart a lot of clusters? Basically using a lot more DBU per hour (although these DBU units cost less than interactive).

-werners-
Esteemed Contributor III

that depends on how you configure the pipeline.

If you have f.e. 10 jobs and you run those 10 jobs in parallel, then 10 job clusters are created.

So you start paying for 10 clusters.

You could also process them sequentially. Like that you only use 1 cluster at the time. However you have to wait to provision each cluster, which is indeed wasted money. Hence I mentioned the cluster pool (warm nodes).

But you could also create f.e. 2 pipelines with 5 notebooks each, and if you use cluster pools you do not waste money/time waiting for nodes to be provisioned.

Running jobs on interactive clusters is almost never cheaper. Remember that parallelism also means jobs are finished faster.

@Werner Stinckens​  I get your point. I would have to look more into it and potentially change my pipeline workflow. Per now, it is not optimized for job clusters at as it runs a separate pipeline for each table (in each stage; Bronze, Silver, Gold). So a lot of pipelines run in parallel which would cause a batch of these job clusters being executed as opposed to a few interactive ones.

-werners-
Esteemed Contributor III

definitely worth looking into.

Mind that interactive clusters are more or less twice as expensive as job clusters concerning DBU

So rather than having a layout like:

Silver/

-Silver_Pipeline_Table1

-Silver_Pipeline_Table2

Gold/

-Gold_Pipeline_Table1

-Gold_Pipeline_Table2

I should use something like:

Tables/

-Table1 (both silver and gold)

-Table2 (both silver and gold)

-werners-
Esteemed Contributor III

it depends on the dependencies.

if gold_table1 is dependent on only silver_table1 I'd do

pipeline1 = silver_table1 -> gold_table1 (sequential). Use a cluster pool so you can use warm workers for gold_table1

And in parallel you could do the same for table2. (or even run then all sequentially)

you can also run multiple notebooks in parallel on the same cluster:

https://learn.microsoft.com/en-us/azure/databricks/notebooks/notebook-workflows#run-multiple-noteboo...

@Werner Stinckens​  Exactly. Our pipelines have a vast array of dependencies which is why we have them as separate pipelines basically waiting for some event to say that dependency pipelines finished -> run new pipeline. Nonetheless, I will try Job clusters out and check the results as compared to our current method. Thanks for the useful details.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.