What is Overwatch?
Overwatch is an observability tool which helps you to monitor spending on your clouds and track usage in various dimensions. It works by collecting job and audit log data, then joining it with data from the Databricks REST API and other sources available in the platform. This data is processed into a set of tables that describe the ongoing activity of your Databricks Workspace(s).
Overwatch is maintained as part of Databricks Labs and supports all the major clouds: Azure, AWS, and GCP. In this post we will look at a variety of analytics made possible by Overwatch, then discuss what a multi-Workspace deployment is and how to implement it!
Features of Overwatch
Monitoring Workspaces
- Total spend
![SriramMohanty_0-1701149209768.png SriramMohanty_0-1701149209768.png](/t5/image/serverpage/image-id/5354i149BA7844ECE08A1/image-size/large?v=v2&px=999)
2. Cluster count by type
![SriramMohanty_2-1701149336228.png SriramMohanty_2-1701149336228.png](/t5/image/serverpage/image-id/5356i40C474DBBB1481B9/image-size/large?v=v2&px=999)
3. DBU cost vs compute cost
![SriramMohanty_3-1701149813391.png SriramMohanty_3-1701149813391.png](/t5/image/serverpage/image-id/5357iFAC742865D07394E/image-size/large?v=v2&px=999)
Monitoring Clusters
- Most expensive clusters by day
![SriramMohanty_0-1701183538445.png SriramMohanty_0-1701183538445.png](/t5/image/serverpage/image-id/5362iD9646AA28E04BFC3/image-size/large?v=v2&px=999)
2. DBU spend by cluster type
![SriramMohanty_1-1701183574382.png SriramMohanty_1-1701183574382.png](/t5/image/serverpage/image-id/5363iB8B6263781E600E0/image-size/large?v=v2&px=999)
3. Cluster node types
![SriramMohanty_2-1701183606944.png SriramMohanty_2-1701183606944.png](/t5/image/serverpage/image-id/5364iD6047AD5591A1D19/image-size/medium?v=v2&px=400)
4.Percentage of auto-scaling clusters
![SriramMohanty_3-1701183629800.png SriramMohanty_3-1701183629800.png](/t5/image/serverpage/image-id/5365iB4DB7D5B5C97E100/image-size/large?v=v2&px=999)
5. Scale up time of clusters without pools
![SriramMohanty_4-1701183781343.png SriramMohanty_4-1701183781343.png](/t5/image/serverpage/image-id/5366iF06E69C00133FF92/image-size/large?v=v2&px=999)
6.Cluster failure state and count of failures
![SriramMohanty_5-1701183812941.png SriramMohanty_5-1701183812941.png](/t5/image/serverpage/image-id/5367i708DC51751C82E84/image-size/large?v=v2&px=999)
Monitoring Jobs
- DBUs by Workflow by Workspace by date
![SriramMohanty_6-1701184000914.png SriramMohanty_6-1701184000914.png](/t5/image/serverpage/image-id/5368i703BE51A329446F9/image-size/large?v=v2&px=999)
2. Jobs running in Interactive clusters
![SriramMohanty_7-1701184033554.png SriramMohanty_7-1701184033554.png](/t5/image/serverpage/image-id/5369i71FC8A03CD38A017/image-size/large?v=v2&px=999)
3.Daily job status distribution
![SriramMohanty_8-1701184057763.png SriramMohanty_8-1701184057763.png](/t5/image/serverpage/image-id/5370i4AE6499505DA0031/image-size/large?v=v2&px=999)
4.Impact of failure by Workspace
![SriramMohanty_9-1701184097575.png SriramMohanty_9-1701184097575.png](/t5/image/serverpage/image-id/5371iE69EB3DEFA51FCA2/image-size/large?v=v2&px=999)
Here are some other analyses you can perform with Overwatch:
- Last 30 days spend
Aggregate cost of cluster spend in all workspaces for the last 30 days.
- Month-over-month change in spend
Percentage change of cluster spend compared with previous month. For example, if the percentages drop below zero, it signifies that usage is down from the previous month, and vice versa.
- Top 3 cluster spend by workspace in the last 30 days
Provides information on the top three clusters that spend the most, per Workspace.
- Week-over-week top 10 fastest growing clusters by Workspace
Top 10 clusters with fastest growth in spend compared with previous week. For example, if the percentages drop below zero, it signifies dip in growth previous week, and vice versa.
- Last 7 days of spend by Databricks Workflow
Expenses for each job in the last 7 days.
- Last 7 days of spend for Databricks Workflows executed on interactive clusters
Expenses for jobs performed on interactive clustersin the previous 7 days
What is a multi-Workspace deployment of Overwatch?
If you possess multiple Databricks workspaces and wish to oversee them collectively, you can implement a multi-Workspace deployment. If the prospect of monitoring jobs across each of your 100 workspaces seems daunting, the solution is at hand. Through a multi-Workspace deployment, a single job in one Workspace can aggregate data from all specified Workspaces and seamlessly incorporate it into a centralized database in your Lakehouse. This enables you to query the data from any workspace of your choosing, streamlining the monitoring process.
Architecture of a multi-Workspace Deployment:
Overwatch can be deployed on a single, primary Workspace and then retrieve data from all other Databricks Workspaces. For more details on requirements see Multi-Workspace Consideration. There are many cases where some Workspaces should be able to monitor many Workspaces and others should only monitor themselves. Additionally, co-location of the output data and who should be able to access what data also comes into play, this reference architecture can accommodate all of these needs. To learn more about the details walk through the deployment steps in the official Overwatch documentation.
![SriramMohanty_11-1701184716367.png SriramMohanty_11-1701184716367.png](/t5/image/serverpage/image-id/5373i67E4D6D7D1C969B1/image-size/large?v=v2&px=999)
How to perform a multi-Workspace deployment
- Download the CSV file.
- Fill the CSV with the workspace details which you want to monitor. Please refer the column descriptions to know more about the columns.
- Add dependant library.
- Run it via Notebook (example here), or run it as a JAR.
For more details and instructions, please visit the official site for Overwatch. You can directly raise an issue in this link.