<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Data Volume Read/Processed for a Databricks Workflow Job in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/data-volume-read-processed-for-a-databricks-workflow-job/m-p/106042#M42363</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/144142"&gt;@sahasimran98&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;you can opt one of the following ways:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Enable Spark Metrics&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Databricks provides detailed metrics for Spark jobs, stages, and tasks. You can enable these metrics and send them to Azure Log Analytics.&lt;DIV&gt;&lt;SPAN&gt;"spark.metrics.conf.*.sink.azureloganalytics.class"&lt;/SPAN&gt;&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;SPAN&gt;"org.apache.spark.metrics.sink.AzureLogAnalyticsSink"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;"spark.metrics.conf.*.sink.azureloganalytics.workspaceId"&lt;/SPAN&gt;&lt;SPAN&gt;: &lt;/SPAN&gt;&lt;SPAN&gt;"&amp;lt;your-log-analytics-workspace-id&amp;gt;"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;"spark.metrics.conf.*.sink.azureloganalytics.primaryKey"&lt;/SPAN&gt;&lt;SPAN&gt;: &lt;/SPAN&gt;&lt;SPAN&gt;"&amp;lt;your-log-analytics-primary-key&amp;gt;"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;"spark.metrics.conf.*.sink.azureloganalytics.period"&lt;/SPAN&gt;&lt;SPAN&gt;: &lt;/SPAN&gt;&lt;SPAN&gt;"10"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV&gt;&lt;STRONG&gt;2. Use Spark Listener&lt;/STRONG&gt;:&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;UL&gt;&lt;LI&gt;Implement a custom Spark listener to capture detailed metrics about data read/processed during job execution.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;from pyspark.sql import SparkSession&lt;BR /&gt;from pyspark.sql.functions import col&lt;/P&gt;&lt;P&gt;class CustomSparkListener(SparkListener):&lt;BR /&gt;def onTaskEnd(self, taskEnd):&lt;BR /&gt;metrics = taskEnd.taskMetrics()&lt;BR /&gt;print(f"Task {taskEnd.taskInfo().taskId()} read {metrics.inputMetrics().bytesRead()} bytes")&lt;/P&gt;&lt;P&gt;spark = SparkSession.builder \&lt;BR /&gt;.appName("CustomSparkListener") \&lt;BR /&gt;.config("spark.extraListeners", "com.example.CustomSparkListener") \&lt;BR /&gt;.getOrCreate()&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;3. Use Databricks REST API&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Use the Databricks REST API to fetch detailed metrics and logs for job runs.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;import requests&lt;/P&gt;&lt;P&gt;databricks_instance = "https://&amp;lt;databricks-instance&amp;gt;"&lt;BR /&gt;token = "&amp;lt;your-databricks-token&amp;gt;"&lt;BR /&gt;job_id = "&amp;lt;your-job-id&amp;gt;"&lt;/P&gt;&lt;P&gt;headers = {&lt;BR /&gt;"Authorization": f"Bearer {token}"&lt;BR /&gt;}&lt;/P&gt;&lt;P&gt;response = requests.get(f"{databricks_instance}/api/2.0/jobs/runs/get?job_id={job_id}", headers=headers)&lt;BR /&gt;job_run_details = response.json()&lt;/P&gt;&lt;P&gt;print(job_run_details&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;4. Monitor Delta Tables&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;If you are using Delta tables, you can monitor Delta Lake transaction logs to gather insights about data read/processed.&lt;/LI&gt;&lt;/UL&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;delta_table&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; DeltaTable.forPath(&lt;/SPAN&gt;&lt;SPAN&gt;spark&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"path/to/delta/table"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;history&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;delta_table&lt;/SPAN&gt;&lt;SPAN&gt;.history()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;history&lt;/SPAN&gt;&lt;SPAN&gt;.show()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Fri, 17 Jan 2025 09:03:55 GMT</pubDate>
    <dc:creator>saurabh18cs</dc:creator>
    <dc:date>2025-01-17T09:03:55Z</dc:date>
    <item>
      <title>Data Volume Read/Processed for a Databricks Workflow Job</title>
      <link>https://community.databricks.com/t5/data-engineering/data-volume-read-processed-for-a-databricks-workflow-job/m-p/106031#M42354</link>
      <description>&lt;P&gt;Hello All, I have a DBx instance hosted on Azure and I am using the Diagnostic Settings to collect Databricks Jobs related logs in log analytics workspace. So far, from the DatabricksJobs table in Azure Loganalytics, I am able to fetch basic job related data like status, duration etc. I am also looking forward to gather some insights about the total (volume of) data read/processed, or throughput through a Job run, something similar to what we get in the ADFActivityRun table for ADF pipeline runs (Copy activities).&lt;BR /&gt;&lt;BR /&gt;I need help to understand the following:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Is my expectation of such a data w.r.t to a DBX job even appropriate? If yes, how can I fetch this kind of a data?&lt;/LI&gt;&lt;LI&gt;If not, how can I gain such data read/processed insights w.r.t Databricks entities? Is it applicable to only delta tables, or something like that?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Also please note: I am using a non Unity-Catalog enabled, multi-node cluster for running the Jobs on (for the time being), but if there are any specific cluster requirements to implement any kind of solution, please do let me know about it.&lt;/P&gt;</description>
      <pubDate>Fri, 17 Jan 2025 06:25:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-volume-read-processed-for-a-databricks-workflow-job/m-p/106031#M42354</guid>
      <dc:creator>sahasimran98</dc:creator>
      <dc:date>2025-01-17T06:25:14Z</dc:date>
    </item>
    <item>
      <title>Re: Data Volume Read/Processed for a Databricks Workflow Job</title>
      <link>https://community.databricks.com/t5/data-engineering/data-volume-read-processed-for-a-databricks-workflow-job/m-p/106042#M42363</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/144142"&gt;@sahasimran98&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;you can opt one of the following ways:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Enable Spark Metrics&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Databricks provides detailed metrics for Spark jobs, stages, and tasks. You can enable these metrics and send them to Azure Log Analytics.&lt;DIV&gt;&lt;SPAN&gt;"spark.metrics.conf.*.sink.azureloganalytics.class"&lt;/SPAN&gt;&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;SPAN&gt;"org.apache.spark.metrics.sink.AzureLogAnalyticsSink"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;"spark.metrics.conf.*.sink.azureloganalytics.workspaceId"&lt;/SPAN&gt;&lt;SPAN&gt;: &lt;/SPAN&gt;&lt;SPAN&gt;"&amp;lt;your-log-analytics-workspace-id&amp;gt;"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;"spark.metrics.conf.*.sink.azureloganalytics.primaryKey"&lt;/SPAN&gt;&lt;SPAN&gt;: &lt;/SPAN&gt;&lt;SPAN&gt;"&amp;lt;your-log-analytics-primary-key&amp;gt;"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;"spark.metrics.conf.*.sink.azureloganalytics.period"&lt;/SPAN&gt;&lt;SPAN&gt;: &lt;/SPAN&gt;&lt;SPAN&gt;"10"&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;DIV&gt;&lt;STRONG&gt;2. Use Spark Listener&lt;/STRONG&gt;:&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;UL&gt;&lt;LI&gt;Implement a custom Spark listener to capture detailed metrics about data read/processed during job execution.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;from pyspark.sql import SparkSession&lt;BR /&gt;from pyspark.sql.functions import col&lt;/P&gt;&lt;P&gt;class CustomSparkListener(SparkListener):&lt;BR /&gt;def onTaskEnd(self, taskEnd):&lt;BR /&gt;metrics = taskEnd.taskMetrics()&lt;BR /&gt;print(f"Task {taskEnd.taskInfo().taskId()} read {metrics.inputMetrics().bytesRead()} bytes")&lt;/P&gt;&lt;P&gt;spark = SparkSession.builder \&lt;BR /&gt;.appName("CustomSparkListener") \&lt;BR /&gt;.config("spark.extraListeners", "com.example.CustomSparkListener") \&lt;BR /&gt;.getOrCreate()&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;3. Use Databricks REST API&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Use the Databricks REST API to fetch detailed metrics and logs for job runs.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;import requests&lt;/P&gt;&lt;P&gt;databricks_instance = "https://&amp;lt;databricks-instance&amp;gt;"&lt;BR /&gt;token = "&amp;lt;your-databricks-token&amp;gt;"&lt;BR /&gt;job_id = "&amp;lt;your-job-id&amp;gt;"&lt;/P&gt;&lt;P&gt;headers = {&lt;BR /&gt;"Authorization": f"Bearer {token}"&lt;BR /&gt;}&lt;/P&gt;&lt;P&gt;response = requests.get(f"{databricks_instance}/api/2.0/jobs/runs/get?job_id={job_id}", headers=headers)&lt;BR /&gt;job_run_details = response.json()&lt;/P&gt;&lt;P&gt;print(job_run_details&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;4. Monitor Delta Tables&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;If you are using Delta tables, you can monitor Delta Lake transaction logs to gather insights about data read/processed.&lt;/LI&gt;&lt;/UL&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;delta_table&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; DeltaTable.forPath(&lt;/SPAN&gt;&lt;SPAN&gt;spark&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"path/to/delta/table"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;history&lt;/SPAN&gt; &lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;delta_table&lt;/SPAN&gt;&lt;SPAN&gt;.history()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;history&lt;/SPAN&gt;&lt;SPAN&gt;.show()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 17 Jan 2025 09:03:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-volume-read-processed-for-a-databricks-workflow-job/m-p/106042#M42363</guid>
      <dc:creator>saurabh18cs</dc:creator>
      <dc:date>2025-01-17T09:03:55Z</dc:date>
    </item>
    <item>
      <title>Re: Data Volume Read/Processed for a Databricks Workflow Job</title>
      <link>https://community.databricks.com/t5/data-engineering/data-volume-read-processed-for-a-databricks-workflow-job/m-p/106262#M42430</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/22314"&gt;@saurabh18cs&lt;/a&gt;&amp;nbsp;, thank you for your response.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Could you please share documentation on the first method you recommended: "&lt;STRONG&gt;Enable Spark Metrics:&amp;nbsp;&lt;/STRONG&gt;Databricks provides detailed metrics for Spark jobs, stages, and tasks. You can enable these metrics and send them to Azure Log Analytics."?&lt;BR /&gt;&lt;BR /&gt;I am a bit unsure about the usage of this, so some reference material would really help!&lt;/P&gt;</description>
      <pubDate>Mon, 20 Jan 2025 02:44:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-volume-read-processed-for-a-databricks-workflow-job/m-p/106262#M42430</guid>
      <dc:creator>sahasimran98</dc:creator>
      <dc:date>2025-01-20T02:44:19Z</dc:date>
    </item>
    <item>
      <title>Re: Data Volume Read/Processed for a Databricks Workflow Job</title>
      <link>https://community.databricks.com/t5/data-engineering/data-volume-read-processed-for-a-databricks-workflow-job/m-p/106319#M42448</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/144142"&gt;@sahasimran98&lt;/a&gt;&amp;nbsp;I think you're right this is more valid for synapse where such configuration exist but you can still give a try for databricks and let us know here the results. otherwise try to find some spark-monitoring package in github for databricks.&lt;/P&gt;</description>
      <pubDate>Mon, 20 Jan 2025 13:23:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-volume-read-processed-for-a-databricks-workflow-job/m-p/106319#M42448</guid>
      <dc:creator>saurabh18cs</dc:creator>
      <dc:date>2025-01-20T13:23:18Z</dc:date>
    </item>
  </channel>
</rss>

