What is Overwatch?
Overwatch is an observability tool which helps you to monitor spending on your clouds and track usage in various dimensions. It works by collecting job and audit log data, then joining it with data from the Databricks REST API and other sources available in the platform. This data is processed into a set of tables that describe the ongoing activity of your Databricks Workspace(s).
Overwatch is maintained as part of Databricks Labs and supports all the major clouds: Azure, AWS, and GCP. In this post we will look at a variety of analytics made possible by Overwatch, then discuss what a multi-Workspace deployment is and how to implement it!
Features of Overwatch
Monitoring Workspaces
- Total spend
2. Cluster count by type
3. DBU cost vs compute cost
Monitoring Clusters
- Most expensive clusters by day
2. DBU spend by cluster type
3. Cluster node types
4.Percentage of auto-scaling clusters
5. Scale up time of clusters without pools
6.Cluster failure state and count of failures
Monitoring Jobs
- DBUs by Workflow by Workspace by date
2. Jobs running in Interactive clusters
3.Daily job status distribution
4.Impact of failure by Workspace
Here are some other analyses you can perform with Overwatch:
- Last 30 days spend
Aggregate cost of cluster spend in all workspaces for the last 30 days.
- Month-over-month change in spend
Percentage change of cluster spend compared with previous month. For example, if the percentages drop below zero, it signifies that usage is down from the previous month, and vice versa.
- Top 3 cluster spend by workspace in the last 30 days
Provides information on the top three clusters that spend the most, per Workspace.
- Week-over-week top 10 fastest growing clusters by Workspace
Top 10 clusters with fastest growth in spend compared with previous week. For example, if the percentages drop below zero, it signifies dip in growth previous week, and vice versa.
- Last 7 days of spend by Databricks Workflow
Expenses for each job in the last 7 days.
- Last 7 days of spend for Databricks Workflows executed on interactive clusters
Expenses for jobs performed on interactive clustersin the previous 7 days
What is a multi-Workspace deployment of Overwatch?
If you possess multiple Databricks workspaces and wish to oversee them collectively, you can implement a multi-Workspace deployment. If the prospect of monitoring jobs across each of your 100 workspaces seems daunting, the solution is at hand. Through a multi-Workspace deployment, a single job in one Workspace can aggregate data from all specified Workspaces and seamlessly incorporate it into a centralized database in your Lakehouse. This enables you to query the data from any workspace of your choosing, streamlining the monitoring process.
Architecture of a multi-Workspace Deployment:
Overwatch can be deployed on a single, primary Workspace and then retrieve data from all other Databricks Workspaces. For more details on requirements see Multi-Workspace Consideration. There are many cases where some Workspaces should be able to monitor many Workspaces and others should only monitor themselves. Additionally, co-location of the output data and who should be able to access what data also comes into play, this reference architecture can accommodate all of these needs. To learn more about the details walk through the deployment steps in the official Overwatch documentation.
How to perform a multi-Workspace deployment
- Download the CSV file.
- Fill the CSV with the workspace details which you want to monitor. Please refer the column descriptions to know more about the columns.
- Add dependant library.
- Run it via Notebook (example here), or run it as a JAR.
For more details and instructions, please visit the official site for Overwatch. You can directly raise an issue in this link.