cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
cancel
Showing results for 
Search instead for 
Did you mean: 

Monitoring job metrics

Bagger
New Contributor II

Hi,

We need to monitor Databricks jobs and we have made a setup where are able to get the prometheus metrics, however, we are lagging an overview of which metrics refer to what.

Namely, we need to monitor the following:

  • failed jobs : is a job failed
  • table ingest rate : how much data is ingested
  • table ingest lag : is a stream job further behind than expected
  • table size : size of the current table being ingested into
  • query runtime : the time a query has been running

Does anyone have any ideas on how to get those metrics (either through Prometheus or an alternative method)?

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @BaggerYou can monitor Databricks jobs and get the required metrics using a combination of Databricks features and Prometheus. Here's a general idea of how you could approach each metric you mentioned.

1. Failed jobs: Databricks provides a REST API that allows you to list all jobs and their status. You can use this to monitor whether a job has failed.

2. Table ingest rate: This can be monitored using the Databricks Delta feature. Delta Lake on Databricks allows you to monitor specific tables, including data ingestion rates 

3. Table ingest lag: Depending on your specific use case, this can be a bit more complex. However, you could consider using Databricks Structured Streaming's built-in feature to report stream progress, which includes metrics such as end-to-end latency.

4. Table size: This can be monitored using Databricks Delta feature. Delta Lake on Databricks allows you to monitor specific tables, including their size.

5. Query runtime: Databricks provides the SQL analytics feature, which allows you to monitor and analyze your SQL workloads. You can use this to monitor the runtime of your queries. 

For Prometheus you can continue using Prometheus for monitoring as the interfaces (e.g., PromQL, Alerting configs) are still being used, even though the backend of the monitoring system has been migrated to M3.

Sources:
- [Databricks REST API](https://docs.databricks.com/dev-tools/api/latest/index.html)
- [Databricks Delta](https://docs.databricks.com/delta/index.html)
- [Databricks Structured Streaming](https://docs.databricks.com/spark/latest/structured-streaming/index.html)
- [Databricks SQL Analytics](https://docs.databricks.com/sql/index.html)

View solution in original post

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @BaggerYou can monitor Databricks jobs and get the required metrics using a combination of Databricks features and Prometheus. Here's a general idea of how you could approach each metric you mentioned.

1. Failed jobs: Databricks provides a REST API that allows you to list all jobs and their status. You can use this to monitor whether a job has failed.

2. Table ingest rate: This can be monitored using the Databricks Delta feature. Delta Lake on Databricks allows you to monitor specific tables, including data ingestion rates 

3. Table ingest lag: Depending on your specific use case, this can be a bit more complex. However, you could consider using Databricks Structured Streaming's built-in feature to report stream progress, which includes metrics such as end-to-end latency.

4. Table size: This can be monitored using Databricks Delta feature. Delta Lake on Databricks allows you to monitor specific tables, including their size.

5. Query runtime: Databricks provides the SQL analytics feature, which allows you to monitor and analyze your SQL workloads. You can use this to monitor the runtime of your queries. 

For Prometheus you can continue using Prometheus for monitoring as the interfaces (e.g., PromQL, Alerting configs) are still being used, even though the backend of the monitoring system has been migrated to M3.

Sources:
- [Databricks REST API](https://docs.databricks.com/dev-tools/api/latest/index.html)
- [Databricks Delta](https://docs.databricks.com/delta/index.html)
- [Databricks Structured Streaming](https://docs.databricks.com/spark/latest/structured-streaming/index.html)
- [Databricks SQL Analytics](https://docs.databricks.com/sql/index.html)

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.