Deriving a relation between spark job and underlying code
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-19-2024 08:51 PM
For one of our requirement, we need to derive a relation between spark job, stage ,task id with the underlying code executed after a workflow job is getting triggered using a job cluster. So far we are able to develop a relation between the Workflow job id and spark job,task,stage id.
Please suggest how should we proceed to derive a tabular relation between the spark job id along with the underlying code in the notebook.
Subhrajyoti Chatterjee
Technology Specialist at Azure Data Engineering Community,
Guild & Community (GnC), Analytics & AI
Cognizant Technology Solutions
Kolkata, India
- Labels:
-
Workflows
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-09-2024 09:40 AM
Hi @Subhrajyoti thanks for your question!
I'm not sure if you have tried this already, but by combining listener logs with structured tabular data, you can create a clear mapping between Spark job executions and the corresponding notebook code. You could leverage Spark’s JobListener interface to capture job, stage, and task information programmatically. Implement a custom listener to log job start and end events along with their properties (e.g., job ID, associated stages, and triggering commands). This lets you capture metadata about jobs as they run, enabling correlation with your notebook code.
Then, you can log relationships between Spark jobs and the triggering notebook code into a structured storage system like a Delta table and use Python logging or spark.sparkContext.setJobDescription to associate a job with a meaningful description (e.g., the code snippet or notebook cell), e.g.:
spark.sparkContext.setJobDescription("ETL Step 1: Load Data")
...And finally write metadata to the table during execution for post-analysis, e.g:
import datetime
spark.sql(f"""
INSERT INTO spark_job_metadata VALUES (
'{job_id}', '{stage_id}', '{task_id}', 'example_notebook', 'ETL Step 1', '{datetime.datetime.now()}'
)
""")