<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to get Spark run-time and structured metrics before job completion? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-get-spark-run-time-and-structured-metrics-before-job/m-p/131979#M49308</link>
    <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;I’m trying to get Spark run-time metrics and structured streaming metrics by enabling cluster logging and now I see the following folders:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="saicharandeepb_0-1757940030183.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19978iC062B52F52F80DC0/image-size/medium?v=v2&amp;amp;px=400" role="button" title="saicharandeepb_0-1757940030183.png" alt="saicharandeepb_0-1757940030183.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;What I noticed is that the eventlog folder only gets populated after a job has completed. That makes it difficult to calculate metrics in near real-time.&lt;/P&gt;&lt;P&gt;Is there a common parser or recommended approach to read from the driver and executor logs so that I can compute these metrics while the job is still running, rather than only after completion?&lt;/P&gt;&lt;P&gt;Thanks in advance for your guidance!&lt;/P&gt;</description>
    <pubDate>Mon, 15 Sep 2025 12:41:01 GMT</pubDate>
    <dc:creator>saicharandeepb</dc:creator>
    <dc:date>2025-09-15T12:41:01Z</dc:date>
    <item>
      <title>How to get Spark run-time and structured metrics before job completion?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-spark-run-time-and-structured-metrics-before-job/m-p/131979#M49308</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;I’m trying to get Spark run-time metrics and structured streaming metrics by enabling cluster logging and now I see the following folders:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="saicharandeepb_0-1757940030183.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19978iC062B52F52F80DC0/image-size/medium?v=v2&amp;amp;px=400" role="button" title="saicharandeepb_0-1757940030183.png" alt="saicharandeepb_0-1757940030183.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;What I noticed is that the eventlog folder only gets populated after a job has completed. That makes it difficult to calculate metrics in near real-time.&lt;/P&gt;&lt;P&gt;Is there a common parser or recommended approach to read from the driver and executor logs so that I can compute these metrics while the job is still running, rather than only after completion?&lt;/P&gt;&lt;P&gt;Thanks in advance for your guidance!&lt;/P&gt;</description>
      <pubDate>Mon, 15 Sep 2025 12:41:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-spark-run-time-and-structured-metrics-before-job/m-p/131979#M49308</guid>
      <dc:creator>saicharandeepb</dc:creator>
      <dc:date>2025-09-15T12:41:01Z</dc:date>
    </item>
    <item>
      <title>Re: How to get Spark run-time and structured metrics before job completion?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-spark-run-time-and-structured-metrics-before-job/m-p/131984#M49312</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/170061"&gt;@saicharandeepb&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I would recommend to use &lt;A href="https://gist.github.com/rayalex/c78b4ccbe560bb72d7f3395ea9ec9a6d" target="_self"&gt;Gist by rayalex&lt;/A&gt;&amp;nbsp;&lt;/P&gt;&lt;P class=""&gt;It integrates EC2 Alloy with Prometheus and Grafana, allowing you to capture and visualize Spark run-time and structured streaming metrics in near real-time.&lt;/P&gt;&lt;P class=""&gt;It’s not a solution natively integrated in Databricks (since, as far as I know, runtime-level access is restricted), but I think it’s a very solid approach if your goal is to collect this information and display it in a dashboard.&lt;BR /&gt;&lt;BR /&gt;Hope this helps &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;Isi&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 15 Sep 2025 13:34:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-spark-run-time-and-structured-metrics-before-job/m-p/131984#M49312</guid>
      <dc:creator>Isi</dc:creator>
      <dc:date>2025-09-15T13:34:00Z</dc:date>
    </item>
    <item>
      <title>Re: How to get Spark run-time and structured metrics before job completion?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-spark-run-time-and-structured-metrics-before-job/m-p/131990#M49315</link>
      <description>&lt;P&gt;I would recommend the following approaches&amp;nbsp;&lt;/P&gt;&lt;TABLE width="628"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="119"&gt;Method&lt;/TD&gt;&lt;TD width="125"&gt;Real-Time?&lt;/TD&gt;&lt;TD width="145"&gt;Complexity&lt;/TD&gt;&lt;TD width="239"&gt;Typical Use Case&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="119"&gt;SparkListener / QueryListener&lt;/TD&gt;&lt;TD width="125"&gt;Yes&amp;nbsp;&lt;/TD&gt;&lt;TD width="145"&gt;Moderate&lt;/TD&gt;&lt;TD width="239"&gt;Job/stage/batch metrics live&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="119"&gt;Custom Metrics Source&lt;/TD&gt;&lt;TD width="125"&gt;Yes (live)&lt;/TD&gt;&lt;TD width="145"&gt;More Advanced&lt;/TD&gt;&lt;TD width="239"&gt;Fine-grained, app-specific&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="119"&gt;Metrics Sinks&amp;nbsp;&lt;/TD&gt;&lt;TD width="125"&gt;Yes&lt;/TD&gt;&lt;TD width="145"&gt;Easy/Mod&lt;/TD&gt;&lt;TD width="239"&gt;External dashboard/monitoring&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;Example or External Prometheus sink:&amp;nbsp;&lt;/P&gt;&lt;P&gt;package org.apache.spark.metrics.source&lt;/P&gt;&lt;P&gt;import com.codahale.metrics.{MetricRegistry, SettableGauge}&lt;BR /&gt;import org.apache.spark.SparkEnv&lt;BR /&gt;import org.apache.spark.sql.streaming.StreamingQueryListener&lt;/P&gt;&lt;P&gt;object MyCustomSource extends Source {&lt;BR /&gt;override def sourceName: String = "MyCustomSource"&lt;BR /&gt;override val metricRegistry: MetricRegistry = new MetricRegistry&lt;BR /&gt;val MY_METRIC_A: SettableGauge[Long] = metricRegistry.gauge(MetricRegistry.name("a"))&lt;/P&gt;&lt;P&gt;class MyListener extends StreamingQueryListener {&lt;BR /&gt;override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {&lt;BR /&gt;MyCustomSource.MY_METRIC_A.setValue(event.progress.batchId)&lt;BR /&gt;}&lt;BR /&gt;}&lt;/P&gt;&lt;P&gt;def apply(): MyListener = {&lt;BR /&gt;SparkEnv.get.metricsSystem.registerSource(MyCustomSource)&lt;BR /&gt;new MyListener()&lt;BR /&gt;}&lt;BR /&gt;}&lt;/P&gt;&lt;P&gt;// Register in your Spark app:&lt;BR /&gt;spark.streams.addListener(MyCustomSource())&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;This exposes custom metrics (here, batchId) to Spark’s metrics system for integration with Prometheus, Grafana&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 15 Sep 2025 14:06:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-spark-run-time-and-structured-metrics-before-job/m-p/131990#M49315</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-09-15T14:06:40Z</dc:date>
    </item>
    <item>
      <title>Re: How to get Spark run-time and structured metrics before job completion?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-get-spark-run-time-and-structured-metrics-before-job/m-p/132061#M49338</link>
      <description>&lt;P&gt;Did you try the above solution ? Keep us updated&lt;/P&gt;</description>
      <pubDate>Tue, 16 Sep 2025 05:33:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-get-spark-run-time-and-structured-metrics-before-job/m-p/132061#M49338</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-09-16T05:33:48Z</dc:date>
    </item>
  </channel>
</rss>

