Databricks Community

noorbasha534 · ‎09-04-2025

Hello all,

I have a job that processes 50 tables - 25 belong to finance, 20 belong to master data, 5 belong to supply chain data domains.

Now, imagine the job ran for 14 hours and did cost me 1000 euros on a day. If I like to attribute the per day cost to the data domains, which of the below statistics we find in table history could be useful - ((I was thinking numOutputRows could be best as executionTime could involve wait times etc, or is there a better way of doing this))

numTargetRowsCopied
numTargetRowsDeleted
numTargetFilesAdded
numTargetBytesAdded
numTargetBytesRemoved
numTargetDeletionVectorsAdded
numTargetRowsMatchedUpdated
executionTimeMs
materializeSourceTimeMs
numTargetRowsInserted
numTargetRowsMatchedDeleted
numTargetDeletionVectorsUpdated
scanTimeMs
numTargetRowsUpdated
numOutputRows
numTargetDeletionVectorsRemoved
numTargetRowsNotMatchedBySourceUpdated
numTargetChangeFilesAdded
numSourceRows
numTargetFilesRemoved
numTargetRowsNotMatchedBySourceDeleted
rewriteTimeMs

szymon_dybczak · ‎09-04-2025

Hi @noorbasha534 ,

To be honest I don't think numOutputRows is is a good candidate. Imaging complex aggregation with multiple jobs and some filtrations on huge dataset. That could return relatively small amount of rows, but the price would be much higher than a job that simply materializes some tables.

You can apply tags to job, so you have means to pretty accurately attribute cost to specific team/project.

For some inspiration you can check following blogs:

https://community.databricks.com/t5/technical-blog/queries-for-cost-attribution-using-system-tables/...

https://www.databricks.com/blog/attribute-serverless-costs-departments-and-users-budget-policies

https://medium.com/dbsql-sme-engineering/introducing-granular-cost-monitoring-for-databricks-sql-e7e...

noorbasha534 · ‎09-04-2025

@szymon_dybczak thanks for the reply. I have a single job that processes tables of multiple domains. I cannot split else the costs will blow up. The granular cost monitoring link shared talks about costs attribution for sql warehouses. We already looked at the math being applied there - compilation time + execution time + xxxxx is considered to attribute costs to users.

In our case, it is a job that runs on job compute. I hope I clarified the situation.

ManojkMohan · ‎09-04-2025

Root Cause / Why executionTimeMs isn’t ideal

executionTimeMs includes everything the job did:

Waiting for resources

Shuffle, GC, or network latency

Contention with other concurrent jobs

Using this to allocate costs can misattribute costs, especially if some tables were idle or blocked while others were actively processing.

So executionTime is noisy for cost attribution — it doesn’t reflect actual data volume processed or work done.

Solution thinking:

Calculate cost per unit of metric:

cost_per_MB = total_job_cost / sum(numTargetBytesAdded for all tables)

Attribute per-domain cost:

cost_per_domain = sum(cost_per_MB * numTargetBytesAdded_for_tables_in_domain)

Optional refinement:

If table sizes vary widely in row size, numTargetBytesAdded is more accurate.

If row sizes are uniform, numOutputRows is simpler.

You could also combine metrics (weighted by output bytes + output rows) for a hybrid approach.

Databricks Community

Cost attribution based on table history statistics

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples