cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Accessing Spark Runtime Metrics Using PySpark – Seeking Best Practices

saicharandeepb
New Contributor II

Hi everyone,

I’m currently working on a solution to access Spark runtime metrics for better monitoring and analysis of our workloads.

From my research, I understand that this can be implemented using SparkListener, which is a JVM interface available in Scala/Java. However, since all our jobs are written in PySpark, I’m looking for ways to implement a similar functionality purely in Python or at least integrate with PySpark workflows effectively.

I’m aware that libraries like pyspark-spy offer methods such as persisting_spark() to capture Spark metrics natively within PySpark, but they don’t cover all the metrics I need. Has anyone tried writing a custom Scala SparkListener to capture detailed runtime metrics, packaging it as a JAR, and attaching it to the Spark cluster? I’m interested in this approach but have been finding it difficult to implement and integrate the Scala listener with PySpark through the JVM gateway.

Are there recommended patterns or tools that simplify this process without needing to maintain Scala code? Additionally, if anyone has examples of writing SparkListener-like behavior purely in PySpark or hybrid approaches, that would be incredibly helpful.

Thanks in advance for your insights!

 

2 REPLIES 2

Brahmareddy
Esteemed Contributor

Hi saicharandeepb,

How are you doing today? as per my understanding, since SparkListener is native to Scala/Java, getting detailed runtime metrics in PySpark can be tricky, but there are some workarounds. If you need deep metrics (like stage-level and executor-level data), the most reliable way is to write a custom SparkListener in Scala, package it as a JAR, and attach it to your Databricks cluster — many teams do this by uploading the JAR to DBFS and referencing it in cluster configs. It can log metrics to a Delta table or external location, which you can then read from PySpark. While there’s no full Python-native listener, some libraries like sparkmeasure or pyspark-spy can help collect basic SQL and job-level metrics, though they’re limited. If you prefer not to manage Scala code, consider using Databricks' built-in tools like Ganglia metrics, audit logs, or the REST API to pull run-level metadata after job completion. Each has trade-offs, but combining these can give decent visibility without diving deep into Scala. Let me know if you’d like a sample setup for any of these!

Hi @Brahmareddy ,

I’m doing well, thanks for asking! Hope you’re doing great too.

I’m particularly interested in deep runtime metrics (stage-level, executor-level, and task breakdowns). I actually tried attaching a custom JAR to the cluster for a SparkListener setup, but I couldn’t get it working successfully.

We’re also keen on getting these metrics in near real time rather than only after job completion, since that would help us with monitoring and faster troubleshooting. It would be really helpful if you could guide me through the setup or share a sample configuration that works in Databricks.

Thanks in advance!