Hi @Bagger, You can monitor Databricks jobs and get the required metrics using a combination of Databricks features and Prometheus. Here's a general idea of how you could approach each metric you mentioned.
1. Failed jobs: Databricks provides a REST API that allows you to list all jobs and their status. You can use this to monitor whether a job has failed.
2. Table ingest rate: This can be monitored using the Databricks Delta feature. Delta Lake on Databricks allows you to monitor specific tables, including data ingestion rates
3. Table ingest lag: Depending on your specific use case, this can be a bit more complex. However, you could consider using Databricks Structured Streaming's built-in feature to report stream progress, which includes metrics such as end-to-end latency.
4. Table size: This can be monitored using Databricks Delta feature. Delta Lake on Databricks allows you to monitor specific tables, including their size.
5. Query runtime: Databricks provides the SQL analytics feature, which allows you to monitor and analyze your SQL workloads. You can use this to monitor the runtime of your queries.
For Prometheus you can continue using Prometheus for monitoring as the interfaces (e.g., PromQL, Alerting configs) are still being used, even though the backend of the monitoring system has been migrated to M3.
Sources:
- [Databricks REST API](https://docs.databricks.com/dev-tools/api/latest/index.html)
- [Databricks Delta](https://docs.databricks.com/delta/index.html)
- [Databricks Structured Streaming](https://docs.databricks.com/spark/latest/structured-streaming/index.html)
- [Databricks SQL Analytics](https://docs.databricks.com/sql/index.html)