Hi everyone,
I’m currently working on a solution to access Spark runtime metrics for better monitoring and analysis of our workloads.
From my research, I understand that this can be implemented using SparkListener, which is a JVM interface available in Scala/Java. However, since all our jobs are written in PySpark, I’m looking for ways to implement a similar functionality purely in Python or at least integrate with PySpark workflows effectively.
I’m aware that libraries like pyspark-spy offer methods such as persisting_spark() to capture Spark metrics natively within PySpark, but they don’t cover all the metrics I need. Has anyone tried writing a custom Scala SparkListener to capture detailed runtime metrics, packaging it as a JAR, and attaching it to the Spark cluster? I’m interested in this approach but have been finding it difficult to implement and integrate the Scala listener with PySpark through the JVM gateway.
Are there recommended patterns or tools that simplify this process without needing to maintain Scala code? Additionally, if anyone has examples of writing SparkListener-like behavior purely in PySpark or hybrid approaches, that would be incredibly helpful.
Thanks in advance for your insights!