<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: UC Enabled cluster for ADF ingestion in Data Governance</title>
    <link>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30257#M880</link>
    <description>&lt;P&gt;definitely worth looking into.&lt;/P&gt;&lt;P&gt;Mind that interactive clusters are more or less twice as expensive as job clusters concerning DBU&lt;/P&gt;</description>
    <pubDate>Thu, 29 Sep 2022 11:56:02 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2022-09-29T11:56:02Z</dc:date>
    <item>
      <title>UC Enabled cluster for ADF ingestion</title>
      <link>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30252#M875</link>
      <description>&lt;P&gt;I am migrating my Data Lake to use Unity Catalog. However, this comes with changes to the clusters. I have tried a few options, but it seems rather complex than it should be. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I need to create a cluster used by ADF that is Unity Enabled that can install a JAR. From my testing, a shared cluster cannot use dbutils which I need to pass parameters (i.e tablename). It also does not allow libraries / JAR to be installed. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;A single user interactive cluster seems like the right approach. However, I am not able to add the ADF service principal as a user.  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;A job cluster works. But I have many pipelines and databricks notebook jobs that runs daily. So it seems rather excessive to kickstart X clusters when one or two interactive cluster can be used &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What is the right approach here for creating a cluster for ADF that is UC enabled, allows dbutils and can have a JAR installed on it?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is running more Job clusters more expensive than one interactive all-purpose one?&lt;/P&gt;</description>
      <pubDate>Thu, 29 Sep 2022 09:09:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30252#M875</guid>
      <dc:creator>ossinova</dc:creator>
      <dc:date>2022-09-29T09:09:33Z</dc:date>
    </item>
    <item>
      <title>Re: UC Enabled cluster for ADF ingestion</title>
      <link>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30253#M876</link>
      <description>&lt;P&gt;I exclusively use job clusters.  They are cheaper.&lt;/P&gt;&lt;P&gt;Especially when you create a pool with spot instances.  I'd go for that because that is what it's made for: batch jobs/&lt;/P&gt;</description>
      <pubDate>Thu, 29 Sep 2022 10:12:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30253#M876</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-09-29T10:12:22Z</dc:date>
    </item>
    <item>
      <title>Re: UC Enabled cluster for ADF ingestion</title>
      <link>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30254#M877</link>
      <description>&lt;P&gt;@Werner Stinckens​&amp;nbsp;But if you have X pipeline with X Databricks Activities. Then won't it kickstart a lot of clusters? Basically using a lot more DBU per hour (although these DBU units cost less than interactive). &lt;/P&gt;</description>
      <pubDate>Thu, 29 Sep 2022 10:34:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30254#M877</guid>
      <dc:creator>ossinova</dc:creator>
      <dc:date>2022-09-29T10:34:22Z</dc:date>
    </item>
    <item>
      <title>Re: UC Enabled cluster for ADF ingestion</title>
      <link>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30255#M878</link>
      <description>&lt;P&gt;that depends on how you configure the pipeline.&lt;/P&gt;&lt;P&gt;If you have f.e. 10 jobs and you run those 10 jobs in parallel, then 10 job clusters are created.&lt;/P&gt;&lt;P&gt;So you start paying for 10 clusters.&lt;/P&gt;&lt;P&gt;You could also process them sequentially.  Like that you only use 1 cluster at the time. However you have to wait to provision each cluster, which is indeed wasted money. Hence I mentioned the cluster pool (warm nodes).&lt;/P&gt;&lt;P&gt;But you could also create f.e. 2 pipelines with 5 notebooks each, and if you use cluster pools you do not waste money/time waiting for nodes to be provisioned.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Running jobs on interactive clusters is almost never cheaper.  Remember that parallelism also means jobs are finished faster.&lt;/P&gt;</description>
      <pubDate>Thu, 29 Sep 2022 11:09:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30255#M878</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-09-29T11:09:09Z</dc:date>
    </item>
    <item>
      <title>Re: UC Enabled cluster for ADF ingestion</title>
      <link>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30256#M879</link>
      <description>&lt;P&gt;@Werner Stinckens​&amp;nbsp; I get your point. I would have to look more into it and potentially change my pipeline workflow. Per now, it is not optimized for job clusters at as it runs a separate pipeline for each table (in each stage; Bronze, Silver, Gold). So a lot of pipelines run in parallel which would cause a batch of these job clusters being executed as opposed to a few interactive ones. &lt;/P&gt;</description>
      <pubDate>Thu, 29 Sep 2022 11:48:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30256#M879</guid>
      <dc:creator>ossinova</dc:creator>
      <dc:date>2022-09-29T11:48:20Z</dc:date>
    </item>
    <item>
      <title>Re: UC Enabled cluster for ADF ingestion</title>
      <link>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30257#M880</link>
      <description>&lt;P&gt;definitely worth looking into.&lt;/P&gt;&lt;P&gt;Mind that interactive clusters are more or less twice as expensive as job clusters concerning DBU&lt;/P&gt;</description>
      <pubDate>Thu, 29 Sep 2022 11:56:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30257#M880</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-09-29T11:56:02Z</dc:date>
    </item>
    <item>
      <title>Re: UC Enabled cluster for ADF ingestion</title>
      <link>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30258#M881</link>
      <description>&lt;P&gt;So rather than having a layout like:&lt;/P&gt;&lt;P&gt;Silver/&lt;/P&gt;&lt;P&gt;-Silver_Pipeline_Table1&lt;/P&gt;&lt;P&gt;-Silver_Pipeline_Table2&lt;/P&gt;&lt;P&gt;Gold/&lt;/P&gt;&lt;P&gt;-Gold_Pipeline_Table1&lt;/P&gt;&lt;P&gt;-Gold_Pipeline_Table2&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I should use something like:&lt;/P&gt;&lt;P&gt;Tables/&lt;/P&gt;&lt;P&gt;-Table1 (both silver and gold)&lt;/P&gt;&lt;P&gt;-Table2 (both silver and gold)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 29 Sep 2022 11:58:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30258#M881</guid>
      <dc:creator>ossinova</dc:creator>
      <dc:date>2022-09-29T11:58:29Z</dc:date>
    </item>
    <item>
      <title>Re: UC Enabled cluster for ADF ingestion</title>
      <link>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30259#M882</link>
      <description>&lt;P&gt;it depends on the dependencies.&lt;/P&gt;&lt;P&gt;if gold_table1 is dependent on only silver_table1 I'd do&lt;/P&gt;&lt;P&gt;pipeline1 = silver_table1 -&amp;gt; gold_table1 (sequential). Use a cluster pool so you can use warm workers for gold_table1&lt;/P&gt;&lt;P&gt;And in parallel you could do the same for table2. (or even run then all sequentially)&lt;/P&gt;&lt;P&gt;you can also run multiple notebooks in parallel on the same cluster:&lt;/P&gt;&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/notebooks/notebook-workflows#run-multiple-notebooks-concurrently" target="test_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/notebooks/notebook-workflows#run-multiple-notebooks-concurrently&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 29 Sep 2022 12:04:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30259#M882</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-09-29T12:04:15Z</dc:date>
    </item>
    <item>
      <title>Re: UC Enabled cluster for ADF ingestion</title>
      <link>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30260#M883</link>
      <description>&lt;P&gt;@Werner Stinckens​&amp;nbsp; Exactly. Our pipelines have a vast array of dependencies which is why we have them as separate pipelines basically waiting for some event to say that dependency pipelines finished -&amp;gt; run new pipeline. Nonetheless, I will try Job clusters out and check the results as compared to our current method. Thanks for the useful details. &lt;/P&gt;</description>
      <pubDate>Thu, 29 Sep 2022 13:44:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-governance/uc-enabled-cluster-for-adf-ingestion/m-p/30260#M883</guid>
      <dc:creator>ossinova</dc:creator>
      <dc:date>2022-09-29T13:44:29Z</dc:date>
    </item>
  </channel>
</rss>

