topic Re: Cost attribution based on table history statistics in Data Engineering

Cost attribution based on table history statistics

noorbasha534 — Thu, 04 Sep 2025 17:09:26 GMT

Hello all,

I have a job that processes 50 tables - 25 belong to finance, 20 belong to master data, 5 belong to supply chain data domains.

Now, imagine the job ran for 14 hours and did cost me 1000 euros on a day. If I like to attribute the per day cost to the data domains, which of the below statistics we find in table history could be useful - ((I was thinking numOutputRows could be best as executionTime could involve wait times etc, or is there a better way of doing this))

numTargetRowsCopied
numTargetRowsDeleted
numTargetFilesAdded
numTargetBytesAdded
numTargetBytesRemoved
numTargetDeletionVectorsAdded
numTargetRowsMatchedUpdated
executionTimeMs
materializeSourceTimeMs
numTargetRowsInserted
numTargetRowsMatchedDeleted
numTargetDeletionVectorsUpdated
scanTimeMs
numTargetRowsUpdated
numOutputRows
numTargetDeletionVectorsRemoved
numTargetRowsNotMatchedBySourceUpdated
numTargetChangeFilesAdded
numSourceRows
numTargetFilesRemoved
numTargetRowsNotMatchedBySourceDeleted
rewriteTimeMs

Re: Cost attribution based on table history statistics

szymon_dybczak — Thu, 04 Sep 2025 17:46:45 GMT

Hi @noorbasha534 ,

To be honest I don't think numOutputRows is is a good candidate. Imaging complex aggregation with multiple jobs and some filtrations on huge dataset. That could return relatively small amount of rows, but the price would be much higher than a job that simply materializes some tables.

You can apply tags to job, so you have means to pretty accurately attribute cost to specific team/project.

For some inspiration you can check following blogs:

https://community.databricks.com/t5/technical-blog/queries-for-cost-attribution-using-system-tables/ba-p/76558

https://www.databricks.com/blog/attribute-serverless-costs-departments-and-users-budget-policies

https://medium.com/dbsql-sme-engineering/introducing-granular-cost-monitoring-for-databricks-sql-e7ea4e77daf5

Re: Cost attribution based on table history statistics

noorbasha534 — Thu, 04 Sep 2025 18:39:38 GMT

@szymon_dybczak thanks for the reply. I have a single job that processes tables of multiple domains. I cannot split else the costs will blow up. The granular cost monitoring link shared talks about costs attribution for sql warehouses. We already looked at the math being applied there - compilation time + execution time + xxxxx is considered to attribute costs to users.

In our case, it is a job that runs on job compute. I hope I clarified the situation.

Re: Cost attribution based on table history statistics

ManojkMohan — Thu, 04 Sep 2025 20:26:54 GMT

Root Cause / Why executionTimeMs isn’t ideal

executionTimeMs includes everything the job did:

Waiting for resources

Shuffle, GC, or network latency

Contention with other concurrent jobs

Using this to allocate costs can misattribute costs, especially if some tables were idle or blocked while others were actively processing.

So executionTime is noisy for cost attribution — it doesn’t reflect actual data volume processed or work done.

Solution thinking:

Calculate cost per unit of metric:

cost_per_MB = total_job_cost / sum(numTargetBytesAdded for all tables)

Attribute per-domain cost:

cost_per_domain = sum(cost_per_MB * numTargetBytesAdded_for_tables_in_domain)

Optional refinement:

If table sizes vary widely in row size, numTargetBytesAdded is more accurate.

If row sizes are uniform, numOutputRows is simpler.

You could also combine metrics (weighted by output bytes + output rows) for a hybrid approach.