cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Cost of individual jobs running on a shared Databricks cluster

Ashish
New Contributor II

Hi All,

I am working on a requirement where I need to calculate the cost of each spark job individually on a shared Azure/AWS Databricks cluster. There can be multiple jobs running on the cluster parallelly.

Cost needs to be calculated after job completion and it has to be calculated programatically

So, I'm looking for an API

-which can fetch the cost of each individual job directly. It would be an ideal solution.

Or

-which can give resource usage of a job in terms of vcore-seconds in Databricks cluster as available in Yarn through Resource Manager API

In a Yarn cluster, I'm using the following approach to calculate the cost of each spark application individually:

Cost = (ec2 instance cost per hour per core * application's vcore-hour) + EMR Fee

In Yarn, resource manager provides resource usage of each spark application in terms of memory-seconds and vcore-seconds

1 ACCEPTED SOLUTION

Accepted Solutions

alexott
Databricks Employee
Databricks Employee

There is a built-in functionality for getting the costs:

The main problem with that functionality is that the smallest granularity you get is cluster/job because it relies on the cluster/job tags. And also there are problems with compute costs when nodes are obtained from the instance pools, as the nodes aren't re-tagged when used.

These problems could be solved by using the project Overwatch that allows to get more granular data, like costs per notebook/user/...

View solution in original post

4 REPLIES 4

-werners-
Esteemed Contributor III

Do you have to calculate this beforehand? Because that is pretty hard to predict.

Especially if you use autoscaling.

Besides hardware provisioning cost, you also have the DBUs.

And the type of VMs which is used, job cluster or interactive cluster etc.

Now, on Azure there is an excellent cost breakdown possible, I suppose on AWS this is also possible.

But that is of course post hoc, so your jobs have already run.

Ashish
New Contributor II

Cost needs to be calculated after job completion and it has to be calculated programatically (which I missed to mention).

And I'm calculating the cost as follow:

Cost = (ec2 instance cost per hour per core * job's vcore-hour) + EMR Fee

Therefore, I need an API which can give resource usage of the job in terms of vcore-seconds/hours in Databricks cluster as available in Yarn through Resource Manager API.

Is there any API available ? or is there any other way to calculate the cost ?

And if there is an API which can give the cost of each job directly, that would be great.

alexott
Databricks Employee
Databricks Employee

There is a built-in functionality for getting the costs:

The main problem with that functionality is that the smallest granularity you get is cluster/job because it relies on the cluster/job tags. And also there are problems with compute costs when nodes are obtained from the instance pools, as the nodes aren't re-tagged when used.

These problems could be solved by using the project Overwatch that allows to get more granular data, like costs per notebook/user/...

Ashish
New Contributor II

No, its not answered yet.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group