Hi,
We need to monitor Databricks jobs and we have made a setup where are able to get the prometheus metrics, however, we are lagging an overview of which metrics refer to what.
Namely, we need to monitor the following:
- failed jobs : is a job failed
- table ingest rate : how much data is ingested
- table ingest lag : is a stream job further behind than expected
- table size : size of the current table being ingested into
- query runtime : the time a query has been running
Does anyone have any ideas on how to get those metrics (either through Prometheus or an alternative method)?