Databricks Community

sunnyday · ‎08-23-2024

I am asking almost the same question as: https://community.databricks.com/t5/data-engineering/how-to-improve-spark-ui-job-description-for-pys...

Because I am running Databricks 15.4, I receive the following message when accessing the sparkContext:

[JVM_ATTRIBUTE_NOT_SUPPORTED] Directly accessing the underlying Spark driver JVM using the attribute 'sparkContext' is not supported on shared clusters. If you require direct access to these fields, consider using a single-user cluster. For more details on compatibility and limitations, check: https://docs.databricks.com/compute/access-mode-limitations.html#shared-access-mode-limitations-on-u...

Accordingly, I do not think that I can use setJobDescription, setName, and setNameas outlined in that answer. Could you please give an example of naming jobs and tasks, including the python class which is called?

Could you also confirm what should be the effect of using df.alias("example_df") in the Spark UI?

I would consider this question answered with these examples:
- Set the JobGroup name in the Spark UI, either from the driver or the worker.

- Set the Job description in the Spark UI, either from the driver or the worker.

- Set the Stage description in the Spark UI, either from the driver or the worker.

- Set the descriptions on the visual blocks in the Dag visualization pages.

The example code, before annotations are added, could look like the following:

from pyspark.sql import SparkSession

def stream_parquet_to_delta(s3_path, delta_table_path‌):
  # Initialize Spark session
  spark = SparkSession.builder.appName("StreamToDeltaExample").getOrCreate()

  # Read streaming data from S3 in Parquet format
  streaming_df = spark.readStream.format("parquet").load(s3_path)

  # Define the function to process each micro-batch
  def process_batch(df, batch_id‌):
     # Write the micro-batch to Delta table
     df.write.format("delta").mode("append").save(delta_table_path)

# Write streaming data to Delta table using foreachBatch
streaming_df.writeStream.foreachBatch(process_batch).start().awaitTermination()

# Example usage
stream_parquet_to_delta("s3://my-bucket/streaming-data", "/delta-table/streaming_data")

Databricks Community

Naming jobs in the Spark UI in Databricks Runtime 15.4

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon