01-10-2022 12:46 AM
Hi All,
I am working on a requirement where I need to calculate the cost of each spark job individually on a shared Azure/AWS Databricks cluster. There can be multiple jobs running on the cluster parallelly.
Cost needs to be calculated after job completion and it has to be calculated programatically
So, I'm looking for an API
-which can fetch the cost of each individual job directly. It would be an ideal solution.
Or
-which can give resource usage of a job in terms of vcore-seconds in Databricks cluster as available in Yarn through Resource Manager API
In a Yarn cluster, I'm using the following approach to calculate the cost of each spark application individually:
Cost = (ec2 instance cost per hour per core * application's vcore-hour) + EMR Fee
In Yarn, resource manager provides resource usage of each spark application in terms of memory-seconds and vcore-seconds.
01-10-2022 02:14 AM
There is a built-in functionality for getting the costs:
The main problem with that functionality is that the smallest granularity you get is cluster/job because it relies on the cluster/job tags. And also there are problems with compute costs when nodes are obtained from the instance pools, as the nodes aren't re-tagged when used.
These problems could be solved by using the project Overwatch that allows to get more granular data, like costs per notebook/user/...
01-10-2022 01:17 AM
Do you have to calculate this beforehand? Because that is pretty hard to predict.
Especially if you use autoscaling.
Besides hardware provisioning cost, you also have the DBUs.
And the type of VMs which is used, job cluster or interactive cluster etc.
Now, on Azure there is an excellent cost breakdown possible, I suppose on AWS this is also possible.
But that is of course post hoc, so your jobs have already run.
01-11-2022 08:10 PM
Cost needs to be calculated after job completion and it has to be calculated programatically (which I missed to mention).
And I'm calculating the cost as follow:
Cost = (ec2 instance cost per hour per core * job's vcore-hour) + EMR Fee
Therefore, I need an API which can give resource usage of the job in terms of vcore-seconds/hours in Databricks cluster as available in Yarn through Resource Manager API.
Is there any API available ? or is there any other way to calculate the cost ?
And if there is an API which can give the cost of each job directly, that would be great.
01-10-2022 02:14 AM
There is a built-in functionality for getting the costs:
The main problem with that functionality is that the smallest granularity you get is cluster/job because it relies on the cluster/job tags. And also there are problems with compute costs when nodes are obtained from the instance pools, as the nodes aren't re-tagged when used.
These problems could be solved by using the project Overwatch that allows to get more granular data, like costs per notebook/user/...
01-15-2022 01:56 AM
No, its not answered yet.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group