<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Cost attribution based on table history statistics in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/cost-attribution-based-on-table-history-statistics/m-p/130885#M48933</link>
    <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/124839"&gt;@noorbasha534&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;To be honest I don't think numOutputRows is is a good candidate. Imaging complex aggregation with multiple jobs and some filtrations on huge dataset. That could return relatively small amount of rows, but the price would be much higher than a job that simply materializes some tables.&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can apply tags to job, so you have means to pretty accurately attribute cost to specific team/project.&lt;/P&gt;&lt;P&gt;For some inspiration you can check following blogs:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.databricks.com/t5/technical-blog/queries-for-cost-attribution-using-system-tables/ba-p/76558" target="_blank"&gt;https://community.databricks.com/t5/technical-blog/queries-for-cost-attribution-using-system-tables/ba-p/76558&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.databricks.com/blog/attribute-serverless-costs-departments-and-users-budget-policies" target="_blank"&gt;https://www.databricks.com/blog/attribute-serverless-costs-departments-and-users-budget-policies&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://medium.com/dbsql-sme-engineering/introducing-granular-cost-monitoring-for-databricks-sql-e7ea4e77daf5" target="_blank"&gt;https://medium.com/dbsql-sme-engineering/introducing-granular-cost-monitoring-for-databricks-sql-e7ea4e77daf5&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 04 Sep 2025 17:46:45 GMT</pubDate>
    <dc:creator>szymon_dybczak</dc:creator>
    <dc:date>2025-09-04T17:46:45Z</dc:date>
    <item>
      <title>Cost attribution based on table history statistics</title>
      <link>https://community.databricks.com/t5/data-engineering/cost-attribution-based-on-table-history-statistics/m-p/130875#M48932</link>
      <description>&lt;P&gt;Hello all,&lt;/P&gt;&lt;P&gt;I have a job that processes 50 tables - 25 belong to finance, 20 belong to master data, 5 belong to supply chain data domains.&lt;/P&gt;&lt;P&gt;Now, imagine the job ran for 14 hours and did cost me 1000 euros on a day. If I like to attribute the per day cost to the data domains, which of the below statistics we find in table history could be useful - ((I was thinking numOutputRows could be best as executionTime could involve wait times etc, or is there a better way of doing this))&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetRowsCopied&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetRowsDeleted&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetFilesAdded&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetBytesAdded&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetBytesRemoved&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetDeletionVectorsAdded&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetRowsMatchedUpdated&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;executionTimeMs&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;materializeSourceTimeMs&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetRowsInserted&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetRowsMatchedDeleted&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetDeletionVectorsUpdated&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;scanTimeMs&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetRowsUpdated&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numOutputRows&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetDeletionVectorsRemoved&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetRowsNotMatchedBySourceUpdated&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetChangeFilesAdded&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numSourceRows&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetFilesRemoved&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;numTargetRowsNotMatchedBySourceDeleted&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;rewriteTimeMs&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Thu, 04 Sep 2025 17:09:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cost-attribution-based-on-table-history-statistics/m-p/130875#M48932</guid>
      <dc:creator>noorbasha534</dc:creator>
      <dc:date>2025-09-04T17:09:26Z</dc:date>
    </item>
    <item>
      <title>Re: Cost attribution based on table history statistics</title>
      <link>https://community.databricks.com/t5/data-engineering/cost-attribution-based-on-table-history-statistics/m-p/130885#M48933</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/124839"&gt;@noorbasha534&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;To be honest I don't think numOutputRows is is a good candidate. Imaging complex aggregation with multiple jobs and some filtrations on huge dataset. That could return relatively small amount of rows, but the price would be much higher than a job that simply materializes some tables.&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can apply tags to job, so you have means to pretty accurately attribute cost to specific team/project.&lt;/P&gt;&lt;P&gt;For some inspiration you can check following blogs:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.databricks.com/t5/technical-blog/queries-for-cost-attribution-using-system-tables/ba-p/76558" target="_blank"&gt;https://community.databricks.com/t5/technical-blog/queries-for-cost-attribution-using-system-tables/ba-p/76558&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.databricks.com/blog/attribute-serverless-costs-departments-and-users-budget-policies" target="_blank"&gt;https://www.databricks.com/blog/attribute-serverless-costs-departments-and-users-budget-policies&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://medium.com/dbsql-sme-engineering/introducing-granular-cost-monitoring-for-databricks-sql-e7ea4e77daf5" target="_blank"&gt;https://medium.com/dbsql-sme-engineering/introducing-granular-cost-monitoring-for-databricks-sql-e7ea4e77daf5&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Sep 2025 17:46:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cost-attribution-based-on-table-history-statistics/m-p/130885#M48933</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-09-04T17:46:45Z</dc:date>
    </item>
    <item>
      <title>Re: Cost attribution based on table history statistics</title>
      <link>https://community.databricks.com/t5/data-engineering/cost-attribution-based-on-table-history-statistics/m-p/130887#M48934</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;thanks for the reply. I have a single job that processes tables of multiple domains. I cannot split else the costs will blow up. The granular cost monitoring link shared talks about costs attribution for sql warehouses. We already looked at the math being applied there - compilation time + execution time + xxxxx is considered to attribute costs to users.&lt;/P&gt;&lt;P&gt;In our case, it is a job that runs on job compute. I hope I clarified the situation.&lt;/P&gt;</description>
      <pubDate>Thu, 04 Sep 2025 18:39:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cost-attribution-based-on-table-history-statistics/m-p/130887#M48934</guid>
      <dc:creator>noorbasha534</dc:creator>
      <dc:date>2025-09-04T18:39:38Z</dc:date>
    </item>
    <item>
      <title>Re: Cost attribution based on table history statistics</title>
      <link>https://community.databricks.com/t5/data-engineering/cost-attribution-based-on-table-history-statistics/m-p/130898#M48938</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Root Cause&lt;/STRONG&gt; / Why executionTimeMs isn’t ideal&lt;/P&gt;&lt;P&gt;executionTimeMs includes everything the job did:&lt;/P&gt;&lt;P&gt;Waiting for resources&lt;/P&gt;&lt;P&gt;Shuffle, GC, or network latency&lt;/P&gt;&lt;P&gt;Contention with other concurrent jobs&lt;/P&gt;&lt;P&gt;Using this to allocate costs can misattribute costs, especially if some tables were idle or blocked while others were actively processing.&lt;/P&gt;&lt;P&gt;So executionTime is noisy for cost attribution — it doesn’t reflect actual data volume processed or work done.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Solution thinking:&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Calculate cost per unit of metric:&lt;/P&gt;&lt;P&gt;cost_per_MB = total_job_cost / sum(numTargetBytesAdded for all tables)&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Attribute per-domain cost:&lt;/P&gt;&lt;P&gt;cost_per_domain = sum(cost_per_MB * numTargetBytesAdded_for_tables_in_domain)&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Optional refinement:&lt;/P&gt;&lt;P&gt;If table sizes vary widely in row size, numTargetBytesAdded is more accurate.&lt;/P&gt;&lt;P&gt;If row sizes are uniform, numOutputRows is simpler.&lt;/P&gt;&lt;P&gt;You could also combine metrics (weighted by output bytes + output rows) for a hybrid approach.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Sep 2025 20:26:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cost-attribution-based-on-table-history-statistics/m-p/130898#M48938</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-09-04T20:26:54Z</dc:date>
    </item>
  </channel>
</rss>

