Databricks Unity Catalog revolutionizes data governance for the Data Intelligence Platform, providing enterprises with AI-powered tools to seamlessly govern their structured and unstructured data, ML models, notebooks, dashboards, and files on any cloud or platform. Integral to this are System Tables, centrally stored and governed by Unity Catalog on Delta Lake, which act as the foundation for Data Intelligence Platform observability, facilitating operational intelligence at scale for critical business inquiries. We will provide a list of queries that can be used to answer the top 10 most important questions about your Data Intelligence Platform status and health monitoring.
Observability in the Data Intelligence Platform
We've always strived to provide transparent metrics within Databricks so that users can monitor what they need to. We added more and more options for different parts of the product and over time we created a heterogeneous landscape of tools to support observability. This wasn't ideal and led to some unintended complexity.
For example, consider cluster monitoring and a Data Engineer that needs to monitor the compute resources available in the platform. Until 2022, they had to use Ganglia, a tool that made CPU usage historical data available only as PNG images of charts. On the other hand, other data was available in either JSON or CSV format, usually dumped to a storage account of the cloud service provider. This did not provide a streamlined user experience for our users.
With the addition of Unity Catalog, we saw an opportunity to improve on this - and make all of the monitoring/observability data available in a consistent and simple way - System Tables.
With System Tables, all of your monitoring/observability data are defined as Delta tables in Unity Catalog, with pre-defined and well-documented schemas (AWS, Azure, GCP). Anyone can explore and aggregate the data directly in Databricks using a Notebook or SQL Editor to extract useful insights. Moreover, dashboards can easily be created for business users, with Lakeview Dashboards or directly in your BI tool of choice, e.g., PowerBI.
For example, this blog post contains a PowerBI template for cost analysis.
System Tables in Unity Catalog
System Tables serve as an analytical repository for your account’s operational data and they are available under the <<system>> catalog in Unity Catalog. They provide a means for historical observability across your account, enabling retrospective analysis of operational activities. There are a multitude of tables available to monitor different aspects of your Data Intelligence Platform: costs, auditing and access, compute, usage, optimizations and the list is growing every quarter with new releases. All of them are organized in schemas as shown in the image below.
Top 10 System Tables Queries
Now that the data is made available in an easy to understand way, a lot of observability opportunities open up for our customers. In this blog post we highlight what we believe are 10 important questions you can answer with System Tables. Below you’ll find a list of questions, divided into 3 categories, with the corresponding queries used to answer them so that you can start using System Tables in your Data Intelligence Platform immediately.
Resource monitoring with system tables provides deep insights into the operation of your Databricks workspaces - from analyzing and optimizing compute to finding poor-performing queries.
The examples below focus primarily on resource monitoring queries for the DB SQL Warehouses. Similar queries could be applied to other types of compute also. Note that some of the tables (such as the query table) were in public preview at the time of writing, but should be available to activate shortly.
This query helps understanding which warehouses are currently active and their running time in hours.
It is specifically useful in bigger organizations where admins struggle to understand usage patterns, making it hard to keep resource usage efficient. This query helps them get an overview of current usage and thus supports taking good decisions on cluster policies.
|
A sample output can be found below -
Alerting Opportunity: As a DBSQL Admin, you might want to be alerted if a warehouse is running longer than expected. You can use the query results to set an alert condition such as triggering an alert when the running hours exceed a certain threshold. As a result, the admin can understand if the warehouse is currently used and if not, he can turn it off manually. Whenever a new user runs a query the Serverless Warehouse will start in a matter of seconds.
This query identifies warehouses that have scaled up (increased in cluster count) and have remained in that state longer than usual.
It allows admins to identify not ideal usage patterns and take action, e.g., reduce the maximum cluster count.
|
A sample output can be found below -
Alerting Opportunity: Alerting on this condition can help you monitor resources and cost. You could set an alert when the upscaled hours exceed a certain limit.
This query is looking for disk spill, which is one of the best indicators for the need to increase t-shirt size in a warehouse. If you notice a warehouse is spilling to disk often, this means that the higher speed memory on the warehouse is full and the warehouse is having to write to disk (slower and less efficient). Increasing machine size in these cases will lead to better query performance and consequently better user experience while decreasing cluster run time.
|
A sample output can be found below -
Queries that have large amounts of shuffle are a good place to look for inefficiently written queries or table structure. Once users know which queries are inefficient they can modify and improve them. The benefit of having efficient queries boils down to less compute time or the possibility to decrease the t-shirt size of a warehouse, resulting in less costs or more performance for the final users.
The query below shows the potentially most inefficient queries from the last 30 days (based on shuffle).
|
A sample output can be found below -
Monitoring costs is an important aspect of a Data Platform. Customers want to make sure the current spend is on track with respect to estimates and budgets. With System Tables you can achieve this easily and give FinOps teams a complete view of the current spend. You can even give them the freedom to operate independently and do their own analysis by assigning them permissions to access the dedicated tables with a couple of clicks in Unity Catalog.
System Tables that allow you to track your current spend are those available in the billing schema. At the time of writing there are two tables available: usage and list_prices.
What are the most common questions our customers worldwide answer with the billing System Tables?
This query identifies the consumption generated in DBUs and $DBUs at list price differentiated by SKU on a daily basis.
Admins are able to understand how users use their platform. If a new DWH migration went live last month, they should expect a spike in SQL or ETL DBUs while a spike in machine learning costs might trigger an investigation.
|
This query identifies the consumption generated in DBUs and $DBUs at list price on a monthly basis by every workspace.
This query is a finer grade version of the previous one, with aggregations by workspace.
|
This query identifies the monthly consumption generated in DBUs and $DBUs by a specific Databricks Workflow.
Admins might engage Data Engineers to look at some heavy pipeline to optimize it.
|
Auditing is pivotal in any information system and this is specifically true for Data Platforms that handle sensitive business information. There are several reasons why this is the case, for compliance, security, accountability, and incident response, among others. Unity Catalog facilitates compliance with regulatory standards by providing an exhaustive list of operations that moved, transformed or exposed your data via the audit System Table, available under the access catalog.
What are the most common questions our customers worldwide answer with the access System Tables?
This query identifies which are the top 10 most accessed tables by user email@databricks.com. Admins might reach out to such users to provide support to optimize their queries if needed.
Access to System Tables can be restricted respecting native Unity Catalog permissions. Matastore admins can restrict access to a specific set of users. Alternatively, Views can be created on top to give dynamic access based on the role of the user who queries. This is useful in situations such as this one where PII data, i.e., the email address, is proposed as part of the results since not everyone in the company might be allowed to see it. For example, if an admin runs this query then the email address is shown in clear, otherwise a hash is returned or the column is hidden completely.
|
This query identifies which are the top 10 users accessing table catalog.schema.table.
|
This query identifies which tables user email@databricks.com accessed in the last 24 hours
|
Conclusion
In conclusion, the Databricks Unity Catalog significantly enhances data governance within the Data Intelligence Platform by offering advanced AI-powered tools for managing both structured and unstructured data, as well as machine learning models, notebooks, dashboards, and files across any cloud or platform. Central to this capability are the System Tables, which are stored and governed on Delta Lake, providing a robust foundation for platform observability and enabling scalable operational intelligence. This infrastructure empowers businesses to address critical inquiries regarding their data platform's status and health. By utilizing a comprehensive set of queries, organizations can effectively monitor and analyze their Data Intelligence Platform, ensuring optimal performance and strategic decision-making.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.