<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Deriving a relation between spark job and underlying code in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/deriving-a-relation-between-spark-job-and-underlying-code/m-p/101509#M40701</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/133068"&gt;@Subhrajyoti&lt;/a&gt;&amp;nbsp;thanks for your question!&lt;/P&gt;
&lt;P&gt;I'm not sure if you have tried this already, but by combining listener logs with structured tabular data, you can create a clear mapping between Spark job executions and the corresponding notebook code. You could&amp;nbsp;leverage Spark’s JobListener interface to capture job, stage, and task information programmatically.&amp;nbsp;Implement a custom listener to log job start and end events along with their properties (e.g., job ID, associated stages, and triggering commands).&amp;nbsp;This lets you capture metadata about jobs as they run, enabling correlation with your notebook code.&lt;/P&gt;
&lt;P&gt;Then, you can log relationships between Spark jobs and the triggering notebook code into a structured storage system like a Delta table and use&amp;nbsp;Python logging or spark.sparkContext.setJobDescription to associate a job with a meaningful description (e.g., the code snippet or notebook cell), e.g.:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;spark.sparkContext.setJobDescription("ETL Step 1: Load Data")&lt;/LI-CODE&gt;
&lt;P&gt;...And finally write metadata to the table during execution for post-analysis, e.g:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;import datetime
spark.sql(f"""
    INSERT INTO spark_job_metadata VALUES (
        '{job_id}', '{stage_id}', '{task_id}', 'example_notebook', 'ETL Step 1', '{datetime.datetime.now()}'
    )
""")&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 09 Dec 2024 17:40:21 GMT</pubDate>
    <dc:creator>VZLA</dc:creator>
    <dc:date>2024-12-09T17:40:21Z</dc:date>
    <item>
      <title>Deriving a relation between spark job and underlying code</title>
      <link>https://community.databricks.com/t5/data-engineering/deriving-a-relation-between-spark-job-and-underlying-code/m-p/99447#M39994</link>
      <description>&lt;P&gt;For one of our requirement, we need to derive a relation between spark job, stage ,task id with the underlying code executed after a workflow job is getting triggered using a job cluster. So far we are able to develop a relation between the Workflow job id and spark job,task,stage id.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Please suggest how should we proceed to derive a tabular relation between the spark job id along with the underlying code in the notebook.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Nov 2024 04:51:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/deriving-a-relation-between-spark-job-and-underlying-code/m-p/99447#M39994</guid>
      <dc:creator>Subhrajyoti</dc:creator>
      <dc:date>2024-11-20T04:51:48Z</dc:date>
    </item>
    <item>
      <title>Re: Deriving a relation between spark job and underlying code</title>
      <link>https://community.databricks.com/t5/data-engineering/deriving-a-relation-between-spark-job-and-underlying-code/m-p/101509#M40701</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/133068"&gt;@Subhrajyoti&lt;/a&gt;&amp;nbsp;thanks for your question!&lt;/P&gt;
&lt;P&gt;I'm not sure if you have tried this already, but by combining listener logs with structured tabular data, you can create a clear mapping between Spark job executions and the corresponding notebook code. You could&amp;nbsp;leverage Spark’s JobListener interface to capture job, stage, and task information programmatically.&amp;nbsp;Implement a custom listener to log job start and end events along with their properties (e.g., job ID, associated stages, and triggering commands).&amp;nbsp;This lets you capture metadata about jobs as they run, enabling correlation with your notebook code.&lt;/P&gt;
&lt;P&gt;Then, you can log relationships between Spark jobs and the triggering notebook code into a structured storage system like a Delta table and use&amp;nbsp;Python logging or spark.sparkContext.setJobDescription to associate a job with a meaningful description (e.g., the code snippet or notebook cell), e.g.:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;spark.sparkContext.setJobDescription("ETL Step 1: Load Data")&lt;/LI-CODE&gt;
&lt;P&gt;...And finally write metadata to the table during execution for post-analysis, e.g:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;import datetime
spark.sql(f"""
    INSERT INTO spark_job_metadata VALUES (
        '{job_id}', '{stage_id}', '{task_id}', 'example_notebook', 'ETL Step 1', '{datetime.datetime.now()}'
    )
""")&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 09 Dec 2024 17:40:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/deriving-a-relation-between-spark-job-and-underlying-code/m-p/101509#M40701</guid>
      <dc:creator>VZLA</dc:creator>
      <dc:date>2024-12-09T17:40:21Z</dc:date>
    </item>
  </channel>
</rss>

