cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Naming jobs in the Spark UI in Databricks Runtime 15.4

sunnyday
New Contributor

I am asking almost the same question as: https://community.databricks.com/t5/data-engineering/how-to-improve-spark-ui-job-description-for-pys...

Because I am running Databricks 15.4, I receive the following message when accessing the sparkContext:

[JVM_ATTRIBUTE_NOT_SUPPORTED] Directly accessing the underlying Spark driver JVM using the attribute 'sparkContext' is not supported on shared clusters. If you require direct access to these fields, consider using a single-user cluster. For more details on compatibility and limitations, check: https://docs.databricks.com/compute/access-mode-limitations.html#shared-access-mode-limitations-on-u...
 
Accordingly, I do not think that I can use setJobDescription, setName, and setNameas outlined in that answer.   Could you please give an example of naming jobs and tasks, including the python class which is called? 
Could you also confirm what should be the effect of using df.alias("example_df") in the Spark UI?  
I would consider this question answered with these examples:
- Set the JobGroup name in the Spark UI, either from the driver or the worker.
- Set the Job description in the Spark UI, either from the driver or the worker.
- Set the Stage description in the Spark UI, either from the driver or the worker.
- Set the descriptions on the visual blocks in the Dag visualization pages.
 
The example code, before annotations are added, could look like the following:

 

from pyspark.sql import SparkSession

def stream_parquet_to_delta(s3_path, delta_table_path‌):
  # Initialize Spark session
  spark = SparkSession.builder.appName("StreamToDeltaExample").getOrCreate()

  # Read streaming data from S3 in Parquet format
  streaming_df = spark.readStream.format("parquet").load(s3_path)

  # Define the function to process each micro-batch
  def process_batch(df, batch_id‌):
     # Write the micro-batch to Delta table
     df.write.format("delta").mode("append").save(delta_table_path)

# Write streaming data to Delta table using foreachBatch
streaming_df.writeStream.foreachBatch(process_batch).start().awaitTermination()

# Example usage
stream_parquet_to_delta("s3://my-bucket/streaming-data", "/delta-table/streaming_data")

 

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

You are correct—on Databricks Runtime 15.4 and with shared clusters (or clusters enabled with Unity Catalog), you will see the [JVM_ATTRIBUTE_NOT_SUPPORTED] error when trying to directly access sparkContext attributes that are only available in single-user cluster modes. This means sc.setJobGroup(), sc.setJobDescription(), and sparkSession.sparkContext.setLocalProperty() are disabled in this mode. Below are your requested clarifications and alternatives.


Naming Jobs, Tasks, and Stages in Databricks 15.4 (Shared Cluster / Unity Catalog)

Job and Task Naming: What Works and What Doesn't

Method Availability (Shared Cluster/Unity Catalog) How to Use / Alternatives
sc.setJobGroup() Not Supported Not possible
sc.setJobDescription() Not Supported Not possible
sc.setLocalProperty('spark.job.description', ...) Not Supported Not possible
DataFrame.alias("example_df") Supported See effect below
DataFrame.writeStream.queryName() Supported See example below
 
 

How to Add Identifiable Names in Spark UI

Rather than the methods that directly use sparkContext, use these supported options:

1. Naming Structured Streaming Queries in Spark UI

Use .queryName() to set a descriptive name for your streaming query. This will appear in the Spark UI under the “active streaming queries.”

Example:

python
streaming_df.writeStream \ .queryName("ParquetToDelta_Stream") \ .foreachBatch(process_batch) \ .start() \ .awaitTermination()

This will name the job “ParquetToDelta_Stream” in the streaming tab of Spark UI.


2. Effect of df.alias("example_df") in Spark UI

  • The .alias("example_df") method only sets a logical alias for the DataFrame used in SQL expressions and does not affect job, stage, or DAG block naming in the Spark UI. It is most helpful for SQL readability and debugging optimizer plans, not for UI description.


3. Block Descriptions in the DAG

  • There is currently no direct API for user-defined block descriptions in the DAG on Databricks when using shared clusters. Block/stage names are inferred from the operations (e.g., "Project", "Aggregate", "Filter").

  • For more explicit names in the UI, break your code into small, well-named functions—thereby making your code easier to correlate with the Spark UI, though the block labels themselves are not user-customizable in shared clusters.


Best Practice Example for Databricks 15.4 (Shared Cluster Mode)

python
from pyspark.sql import SparkSession def stream_parquet_to_delta(s3_path, delta_table_path): spark = SparkSession.builder.appName("StreamToDeltaExample").getOrCreate() streaming_df = spark.readStream.format("parquet").load(s3_path) def process_batch(df, batch_id): # logic here... df.write.format("delta").mode("append").save(delta_table_path) # Set a meaningful name for the streaming query in Spark UI streaming_df.writeStream \ .queryName("ParquetToDelta_Stream") \ .foreachBatch(process_batch) \ .start() \ .awaitTermination() stream_parquet_to_delta("s3://my-bucket/streaming-data", "/delta-table/streaming_data")
  • This query will have the name “ParquetToDelta_Stream” in the Streaming tab of the Spark UI.

  • The code structure assists with relating code blocks to the DAG plan, but block descriptions are not directly moddable in the UI under shared clusters.


Summary Table: What Is and Isn't Possible

Want to Set... Supported? (Shared Cluster) How/Alternative
Job Group Name No Not supported
Job Description No Not supported
Stage Description No Not supported
Query Name (Streaming) Yes Use .queryName() on writeStream
Table/View Name Yes Create temporary views with createOrReplaceTempView()
DataFrame Alias Yes Use .alias(), but only affects SQL plans