<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Cost of individual jobs running on a shared Databricks cluster in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/cost-of-individual-jobs-running-on-a-shared-databricks-cluster/m-p/31975#M23315</link>
    <description>&lt;P&gt;Cost needs to be calculated after job completion and it has to be calculated programatically (which&lt;I&gt; &lt;/I&gt;I missed to mention).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;And I'm calculating the cost as follow:&lt;/P&gt;&lt;P&gt;Cost = (ec2 instance cost per hour per core * job's &lt;B&gt;vcore-hour&lt;/B&gt;) + EMR Fee&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Therefore, I need an API which can give resource usage of the job in terms of &lt;B&gt;vcore-seconds/hours&lt;/B&gt; in Databricks cluster as available in Yarn through Resource Manager API.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is there any API available ? or is there any other way to calculate the cost ?&lt;/P&gt;&lt;P&gt;And if there is an API which can give the cost of each job directly, that would be great.&lt;/P&gt;</description>
    <pubDate>Wed, 12 Jan 2022 04:10:24 GMT</pubDate>
    <dc:creator>Ashish</dc:creator>
    <dc:date>2022-01-12T04:10:24Z</dc:date>
    <item>
      <title>Cost of individual jobs running on a shared Databricks cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/cost-of-individual-jobs-running-on-a-shared-databricks-cluster/m-p/31971#M23311</link>
      <description>&lt;P&gt;Hi All,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am working on a requirement where I need to calculate the &lt;B&gt;cost of each spark job individually&lt;/B&gt; on a shared Azure/AWS Databricks cluster. There can be multiple jobs running on the cluster parallelly.&lt;/P&gt;&lt;P&gt;Cost needs to be calculated after job completion and it has to be calculated programatically&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So, I'm looking for an API&lt;/P&gt;&lt;P&gt;-which can fetch the cost of each individual job directly. It would be an ideal solution.&lt;/P&gt;&lt;P&gt;Or&lt;/P&gt;&lt;P&gt;-which can give resource usage of a job in terms of &lt;B&gt;vcore-seconds&lt;/B&gt; in Databricks cluster as available in Yarn through &lt;B&gt;Resource Manager API&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;In a Yarn cluster, I'm using the following approach to calculate the cost of each spark application individually:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Cost = (ec2 instance cost per hour per core * application's &lt;B&gt;vcore-hour&lt;/B&gt;) + EMR Fee&lt;/P&gt;&lt;P&gt;In Yarn, resource manager provides resource usage of each spark application in terms of&amp;nbsp;&lt;B&gt;memory-seconds&lt;/B&gt;&amp;nbsp;and&amp;nbsp;&lt;B&gt;vcore-seconds&lt;/B&gt;.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Jan 2022 08:46:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cost-of-individual-jobs-running-on-a-shared-databricks-cluster/m-p/31971#M23311</guid>
      <dc:creator>Ashish</dc:creator>
      <dc:date>2022-01-10T08:46:47Z</dc:date>
    </item>
    <item>
      <title>Re: Cost of individual jobs running on a shared Databricks cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/cost-of-individual-jobs-running-on-a-shared-databricks-cluster/m-p/31972#M23312</link>
      <description>&lt;P&gt;Do you have to calculate this beforehand?  Because that is pretty hard to predict.&lt;/P&gt;&lt;P&gt;Especially if you use autoscaling.&lt;/P&gt;&lt;P&gt;Besides hardware provisioning cost, you also have the DBUs.&lt;/P&gt;&lt;P&gt;And the type of VMs which is used, job cluster or interactive cluster etc.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Now, on Azure there is an excellent cost breakdown possible, I suppose on AWS this is also possible.&lt;/P&gt;&lt;P&gt;But that is of course post hoc, so your jobs have already run.&lt;/P&gt;</description>
      <pubDate>Mon, 10 Jan 2022 09:17:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cost-of-individual-jobs-running-on-a-shared-databricks-cluster/m-p/31972#M23312</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-01-10T09:17:22Z</dc:date>
    </item>
    <item>
      <title>Re: Cost of individual jobs running on a shared Databricks cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/cost-of-individual-jobs-running-on-a-shared-databricks-cluster/m-p/31973#M23313</link>
      <description>&lt;P&gt;There is a built-in functionality for getting the costs:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;AWS - &lt;A href="https://docs.databricks.com/administration-guide/account-settings-e2/usage.html" target="test_blank"&gt;https://docs.databricks.com/administration-guide/account-settings-e2/usage.html&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Azure - via built-in Cost Management + Billing&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;The main problem with that functionality is that the smallest granularity you get is cluster/job because it relies on the cluster/job tags. And also there are problems with compute costs when nodes are obtained from the instance pools, as the nodes aren't re-tagged when used.&lt;/P&gt;&lt;P&gt;These problems could be solved by using the &lt;A href="https://github.com/databrickslabs/overwatch" alt="https://github.com/databrickslabs/overwatch" target="_blank"&gt;project Overwatch&lt;/A&gt; that allows to get more granular data, like costs per notebook/user/...&lt;/P&gt;</description>
      <pubDate>Mon, 10 Jan 2022 10:14:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cost-of-individual-jobs-running-on-a-shared-databricks-cluster/m-p/31973#M23313</guid>
      <dc:creator>alexott</dc:creator>
      <dc:date>2022-01-10T10:14:36Z</dc:date>
    </item>
    <item>
      <title>Re: Cost of individual jobs running on a shared Databricks cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/cost-of-individual-jobs-running-on-a-shared-databricks-cluster/m-p/31975#M23315</link>
      <description>&lt;P&gt;Cost needs to be calculated after job completion and it has to be calculated programatically (which&lt;I&gt; &lt;/I&gt;I missed to mention).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;And I'm calculating the cost as follow:&lt;/P&gt;&lt;P&gt;Cost = (ec2 instance cost per hour per core * job's &lt;B&gt;vcore-hour&lt;/B&gt;) + EMR Fee&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Therefore, I need an API which can give resource usage of the job in terms of &lt;B&gt;vcore-seconds/hours&lt;/B&gt; in Databricks cluster as available in Yarn through Resource Manager API.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is there any API available ? or is there any other way to calculate the cost ?&lt;/P&gt;&lt;P&gt;And if there is an API which can give the cost of each job directly, that would be great.&lt;/P&gt;</description>
      <pubDate>Wed, 12 Jan 2022 04:10:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cost-of-individual-jobs-running-on-a-shared-databricks-cluster/m-p/31975#M23315</guid>
      <dc:creator>Ashish</dc:creator>
      <dc:date>2022-01-12T04:10:24Z</dc:date>
    </item>
    <item>
      <title>Re: Cost of individual jobs running on a shared Databricks cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/cost-of-individual-jobs-running-on-a-shared-databricks-cluster/m-p/31976#M23316</link>
      <description>&lt;P&gt;No, its not answered yet.&lt;/P&gt;</description>
      <pubDate>Sat, 15 Jan 2022 09:56:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cost-of-individual-jobs-running-on-a-shared-databricks-cluster/m-p/31976#M23316</guid>
      <dc:creator>Ashish</dc:creator>
      <dc:date>2022-01-15T09:56:18Z</dc:date>
    </item>
  </channel>
</rss>

