09-29-2022 02:09 AM
I am migrating my Data Lake to use Unity Catalog. However, this comes with changes to the clusters. I have tried a few options, but it seems rather complex than it should be.
I need to create a cluster used by ADF that is Unity Enabled that can install a JAR. From my testing, a shared cluster cannot use dbutils which I need to pass parameters (i.e tablename). It also does not allow libraries / JAR to be installed.
A single user interactive cluster seems like the right approach. However, I am not able to add the ADF service principal as a user.
A job cluster works. But I have many pipelines and databricks notebook jobs that runs daily. So it seems rather excessive to kickstart X clusters when one or two interactive cluster can be used
What is the right approach here for creating a cluster for ADF that is UC enabled, allows dbutils and can have a JAR installed on it?
Is running more Job clusters more expensive than one interactive all-purpose one?
09-29-2022 03:12 AM
I exclusively use job clusters. They are cheaper.
Especially when you create a pool with spot instances. I'd go for that because that is what it's made for: batch jobs/
09-29-2022 03:34 AM
@Werner Stinckens But if you have X pipeline with X Databricks Activities. Then won't it kickstart a lot of clusters? Basically using a lot more DBU per hour (although these DBU units cost less than interactive).
09-29-2022 04:09 AM
that depends on how you configure the pipeline.
If you have f.e. 10 jobs and you run those 10 jobs in parallel, then 10 job clusters are created.
So you start paying for 10 clusters.
You could also process them sequentially. Like that you only use 1 cluster at the time. However you have to wait to provision each cluster, which is indeed wasted money. Hence I mentioned the cluster pool (warm nodes).
But you could also create f.e. 2 pipelines with 5 notebooks each, and if you use cluster pools you do not waste money/time waiting for nodes to be provisioned.
Running jobs on interactive clusters is almost never cheaper. Remember that parallelism also means jobs are finished faster.
09-29-2022 04:48 AM
@Werner Stinckens I get your point. I would have to look more into it and potentially change my pipeline workflow. Per now, it is not optimized for job clusters at as it runs a separate pipeline for each table (in each stage; Bronze, Silver, Gold). So a lot of pipelines run in parallel which would cause a batch of these job clusters being executed as opposed to a few interactive ones.
09-29-2022 04:56 AM
definitely worth looking into.
Mind that interactive clusters are more or less twice as expensive as job clusters concerning DBU
09-29-2022 04:58 AM
So rather than having a layout like:
Silver/
-Silver_Pipeline_Table1
-Silver_Pipeline_Table2
Gold/
-Gold_Pipeline_Table1
-Gold_Pipeline_Table2
I should use something like:
Tables/
-Table1 (both silver and gold)
-Table2 (both silver and gold)
09-29-2022 05:04 AM
it depends on the dependencies.
if gold_table1 is dependent on only silver_table1 I'd do
pipeline1 = silver_table1 -> gold_table1 (sequential). Use a cluster pool so you can use warm workers for gold_table1
And in parallel you could do the same for table2. (or even run then all sequentially)
you can also run multiple notebooks in parallel on the same cluster:
09-29-2022 06:44 AM
@Werner Stinckens Exactly. Our pipelines have a vast array of dependencies which is why we have them as separate pipelines basically waiting for some event to say that dependency pipelines finished -> run new pipeline. Nonetheless, I will try Job clusters out and check the results as compared to our current method. Thanks for the useful details.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group