cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Cost of individual jobs running on a shared Databricks cluster

Ashish
New Contributor II

Hi All,

I am working on a requirement where I need to calculate the cost of each spark job individually on a shared Azure/AWS Databricks cluster. There can be multiple jobs running on the cluster parallelly.

Cost needs to be calculated after job completion and it has to be calculated programatically

So, I'm looking for an API

-which can fetch the cost of each individual job directly. It would be an ideal solution.

Or

-which can give resource usage of a job in terms of vcore-seconds in Databricks cluster as available in Yarn through Resource Manager API

In a Yarn cluster, I'm using the following approach to calculate the cost of each spark application individually:

Cost = (ec2 instance cost per hour per core * application's vcore-hour) + EMR Fee

In Yarn, resource manager provides resource usage of each spark application in terms of memory-seconds and vcore-seconds

1 ACCEPTED SOLUTION

Accepted Solutions

alexott
Valued Contributor II
Valued Contributor II

There is a built-in functionality for getting the costs:

The main problem with that functionality is that the smallest granularity you get is cluster/job because it relies on the cluster/job tags. And also there are problems with compute costs when nodes are obtained from the instance pools, as the nodes aren't re-tagged when used.

These problems could be solved by using the project Overwatch that allows to get more granular data, like costs per notebook/user/...

View solution in original post

5 REPLIES 5

-werners-
Esteemed Contributor III

Do you have to calculate this beforehand? Because that is pretty hard to predict.

Especially if you use autoscaling.

Besides hardware provisioning cost, you also have the DBUs.

And the type of VMs which is used, job cluster or interactive cluster etc.

Now, on Azure there is an excellent cost breakdown possible, I suppose on AWS this is also possible.

But that is of course post hoc, so your jobs have already run.

Ashish
New Contributor II

Cost needs to be calculated after job completion and it has to be calculated programatically (which I missed to mention).

And I'm calculating the cost as follow:

Cost = (ec2 instance cost per hour per core * job's vcore-hour) + EMR Fee

Therefore, I need an API which can give resource usage of the job in terms of vcore-seconds/hours in Databricks cluster as available in Yarn through Resource Manager API.

Is there any API available ? or is there any other way to calculate the cost ?

And if there is an API which can give the cost of each job directly, that would be great.

alexott
Valued Contributor II
Valued Contributor II

There is a built-in functionality for getting the costs:

The main problem with that functionality is that the smallest granularity you get is cluster/job because it relies on the cluster/job tags. And also there are problems with compute costs when nodes are obtained from the instance pools, as the nodes aren't re-tagged when used.

These problems could be solved by using the project Overwatch that allows to get more granular data, like costs per notebook/user/...

Kaniz
Community Manager
Community Manager

Hi @Ashish Kardam​  (Customer)​ , Does @[werners] (Customer)​ 's or @Alex Ott​ 's replies answer your question?

Ashish
New Contributor II

No, its not answered yet.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.