topic Re: UC Enabled cluster for ADF ingestion in Data Governance

UC Enabled cluster for ADF ingestion

ossinova — Thu, 29 Sep 2022 09:09:33 GMT

I am migrating my Data Lake to use Unity Catalog. However, this comes with changes to the clusters. I have tried a few options, but it seems rather complex than it should be.

I need to create a cluster used by ADF that is Unity Enabled that can install a JAR. From my testing, a shared cluster cannot use dbutils which I need to pass parameters (i.e tablename). It also does not allow libraries / JAR to be installed.

A single user interactive cluster seems like the right approach. However, I am not able to add the ADF service principal as a user.

A job cluster works. But I have many pipelines and databricks notebook jobs that runs daily. So it seems rather excessive to kickstart X clusters when one or two interactive cluster can be used

What is the right approach here for creating a cluster for ADF that is UC enabled, allows dbutils and can have a JAR installed on it?

Is running more Job clusters more expensive than one interactive all-purpose one?

Re: UC Enabled cluster for ADF ingestion

-werners- — Thu, 29 Sep 2022 10:12:22 GMT

I exclusively use job clusters. They are cheaper.

Especially when you create a pool with spot instances. I'd go for that because that is what it's made for: batch jobs/

Re: UC Enabled cluster for ADF ingestion

ossinova — Thu, 29 Sep 2022 10:34:22 GMT

@Werner Stinckens But if you have X pipeline with X Databricks Activities. Then won't it kickstart a lot of clusters? Basically using a lot more DBU per hour (although these DBU units cost less than interactive).

Re: UC Enabled cluster for ADF ingestion

-werners- — Thu, 29 Sep 2022 11:09:09 GMT

that depends on how you configure the pipeline.

If you have f.e. 10 jobs and you run those 10 jobs in parallel, then 10 job clusters are created.

So you start paying for 10 clusters.

You could also process them sequentially. Like that you only use 1 cluster at the time. However you have to wait to provision each cluster, which is indeed wasted money. Hence I mentioned the cluster pool (warm nodes).

But you could also create f.e. 2 pipelines with 5 notebooks each, and if you use cluster pools you do not waste money/time waiting for nodes to be provisioned.

Running jobs on interactive clusters is almost never cheaper. Remember that parallelism also means jobs are finished faster.

Re: UC Enabled cluster for ADF ingestion

ossinova — Thu, 29 Sep 2022 11:48:20 GMT

@Werner Stinckens I get your point. I would have to look more into it and potentially change my pipeline workflow. Per now, it is not optimized for job clusters at as it runs a separate pipeline for each table (in each stage; Bronze, Silver, Gold). So a lot of pipelines run in parallel which would cause a batch of these job clusters being executed as opposed to a few interactive ones.

Re: UC Enabled cluster for ADF ingestion

-werners- — Thu, 29 Sep 2022 11:56:02 GMT

definitely worth looking into.

Mind that interactive clusters are more or less twice as expensive as job clusters concerning DBU

Re: UC Enabled cluster for ADF ingestion

ossinova — Thu, 29 Sep 2022 11:58:29 GMT

So rather than having a layout like:

Silver/

-Silver_Pipeline_Table1

-Silver_Pipeline_Table2

Gold/

-Gold_Pipeline_Table1

-Gold_Pipeline_Table2

I should use something like:

Tables/

-Table1 (both silver and gold)

-Table2 (both silver and gold)

Re: UC Enabled cluster for ADF ingestion

-werners- — Thu, 29 Sep 2022 12:04:15 GMT

it depends on the dependencies.

if gold_table1 is dependent on only silver_table1 I'd do

pipeline1 = silver_table1 -> gold_table1 (sequential). Use a cluster pool so you can use warm workers for gold_table1

And in parallel you could do the same for table2. (or even run then all sequentially)

you can also run multiple notebooks in parallel on the same cluster:

https://learn.microsoft.com/en-us/azure/databricks/notebooks/notebook-workflows#run-multiple-notebooks-concurrently

Re: UC Enabled cluster for ADF ingestion

ossinova — Thu, 29 Sep 2022 13:44:29 GMT

@Werner Stinckens Exactly. Our pipelines have a vast array of dependencies which is why we have them as separate pipelines basically waiting for some event to say that dependency pipelines finished -> run new pipeline. Nonetheless, I will try Job clusters out and check the results as compared to our current method. Thanks for the useful details.