cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Cost attribution based on table history statistics

noorbasha534
Valued Contributor II

Hello all,

I have a job that processes 50 tables - 25 belong to finance, 20 belong to master data, 5 belong to supply chain data domains.

Now, imagine the job ran for 14 hours and did cost me 1000 euros on a day. If I like to attribute the per day cost to the data domains, which of the below statistics we find in table history could be useful - ((I was thinking numOutputRows could be best as executionTime could involve wait times etc, or is there a better way of doing this))

  • numTargetRowsCopied
  • numTargetRowsDeleted
  • numTargetFilesAdded
  • numTargetBytesAdded
  • numTargetBytesRemoved
  • numTargetDeletionVectorsAdded
  • numTargetRowsMatchedUpdated
  • executionTimeMs
  • materializeSourceTimeMs
  • numTargetRowsInserted
  • numTargetRowsMatchedDeleted
  • numTargetDeletionVectorsUpdated
  • scanTimeMs
  • numTargetRowsUpdated
  • numOutputRows
  • numTargetDeletionVectorsRemoved
  • numTargetRowsNotMatchedBySourceUpdated
  • numTargetChangeFilesAdded
  • numSourceRows
  • numTargetFilesRemoved
  • numTargetRowsNotMatchedBySourceDeleted
  • rewriteTimeMs
3 REPLIES 3

szymon_dybczak
Esteemed Contributor III

Hi @noorbasha534 ,

To be honest I don't think numOutputRows is is a good candidate. Imaging complex aggregation with multiple jobs and some filtrations on huge dataset. That could return relatively small amount of rows, but the price would be much higher than a job that simply materializes some tables. 

You can apply tags to job, so you have means to pretty accurately attribute cost to specific team/project.

For some inspiration you can check following blogs: 

https://community.databricks.com/t5/technical-blog/queries-for-cost-attribution-using-system-tables/...

https://www.databricks.com/blog/attribute-serverless-costs-departments-and-users-budget-policies

https://medium.com/dbsql-sme-engineering/introducing-granular-cost-monitoring-for-databricks-sql-e7e...

noorbasha534
Valued Contributor II

@szymon_dybczak thanks for the reply. I have a single job that processes tables of multiple domains. I cannot split else the costs will blow up. The granular cost monitoring link shared talks about costs attribution for sql warehouses. We already looked at the math being applied there - compilation time + execution time + xxxxx is considered to attribute costs to users.

In our case, it is a job that runs on job compute. I hope I clarified the situation.

ManojkMohan
Contributor III

Root Cause / Why executionTimeMs isnโ€™t ideal

executionTimeMs includes everything the job did:

Waiting for resources

Shuffle, GC, or network latency

Contention with other concurrent jobs

Using this to allocate costs can misattribute costs, especially if some tables were idle or blocked while others were actively processing.

So executionTime is noisy for cost attribution โ€” it doesnโ€™t reflect actual data volume processed or work done.

Solution thinking: 

Calculate cost per unit of metric:

cost_per_MB = total_job_cost / sum(numTargetBytesAdded for all tables)


Attribute per-domain cost:

cost_per_domain = sum(cost_per_MB * numTargetBytesAdded_for_tables_in_domain)


Optional refinement:

If table sizes vary widely in row size, numTargetBytesAdded is more accurate.

If row sizes are uniform, numOutputRows is simpler.

You could also combine metrics (weighted by output bytes + output rows) for a hybrid approach.

 

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now