I know that there is already the Databricks (technically Spark) integration for DataDog. Unfortunately, that integration only covers the cluster execution itself and that means only Cluster Metrics and Spark Jobs and Tasks. I'm looking for something that will allow me to track metrics about the Databricks Jobs (e.g., Successful Job Runs, Failed Tasks, etc.).
Currently, it seems like my only option would be to use a combination of webhook notifications for the job and custom code inside the task. Unfortunately, this has the following drawbacks:
- We use PySpark and if I get a Kernel Unresponsive error, then I'm not going to be able to report metrics with custom code.
- The web hook notifications don't produce events that are ready for use as custom metrics within DataDog so I will have to create some sort of custom handler for mapping/enriching the events from the web hook.
I feel like this isn't a unique challenge and that I'm not the only one who would look for something like this, so before I go down the rabbit hole of crafting something home-brew to solve this problem, I thought I would check with the community/Databricks support to see if I'm missing something.