cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
mattiazeni
Databricks Employee
Databricks Employee

Avnish_Jain_1-1723646794244.png

Databricks Unity Catalog revolutionizes data governance for the Data Intelligence Platform, providing enterprises with AI-powered tools to seamlessly govern their structured and unstructured data, ML models, notebooks, dashboards, and files on any cloud or platform. Integral to this are System Tables, centrally stored and governed by Unity Catalog on Delta Lake, which act as the foundation for Data Intelligence Platform observability, facilitating operational intelligence at scale for critical business inquiries. We will provide a list of queries that can be used to answer the top 10 most important questions about your Data Intelligence Platform status and health monitoring.

Observability in the Data Intelligence Platform

We've always strived to provide transparent metrics within Databricks so that users can monitor what they need to. We added more and more options for different parts of the product and over time we created a heterogeneous landscape of tools to support observability. This wasn't ideal and led to some unintended complexity.

For example, consider cluster monitoring and a Data Engineer that needs to monitor the compute resources available in the platform. Until 2022, they had to use Ganglia, a tool that made CPU usage historical data available only as PNG images of charts. On the other hand, other data was available in either JSON or CSV format, usually dumped to a storage account of the cloud service provider. This did not provide a streamlined user experience for our users.

mattiazeni_0-1723102044377.png

With the addition of Unity Catalog, we saw an opportunity to improve on this - and make all of the monitoring/observability data available in a consistent and simple way - System Tables.

With System Tables, all of your monitoring/observability data are defined as Delta tables in Unity Catalog, with pre-defined and well-documented schemas (AWS, Azure, GCP). Anyone can explore and aggregate the data directly in Databricks using a Notebook or SQL Editor to extract useful insights. Moreover, dashboards can easily be created for business users, with Lakeview Dashboards or directly in your BI tool of choice, e.g., PowerBI.

For example, this blog post contains a PowerBI template for cost analysis. 

mattiazeni_1-1723102044426.png


System Tables in Unity Catalog

System Tables serve as an analytical repository for your account’s operational data and they are available under the <<system>> catalog in Unity Catalog. They provide a means for historical observability across your account, enabling retrospective analysis of operational activities. There are a multitude of tables available to monitor different aspects of your Data Intelligence Platform: costs, auditing and access, compute, usage, optimizations and the list is growing every quarter with new releases. All of them are organized in schemas as shown in the image below.

mattiazeni_2-1723102044252.png

 

Top 10 System Tables Queries

Now that the data is made available in an easy to understand way, a lot of observability opportunities open up for our customers. In this blog post we highlight what we believe are 10 important questions you can answer with System Tables. Below you’ll find a list of questions, divided into 3 categories, with the corresponding queries used to answer them so that you can start using System Tables in your Data Intelligence Platform immediately.

Resource Monitoring 

Resource monitoring with system tables provides deep insights into the operation of your Databricks workspaces - from analyzing and optimizing compute to finding poor-performing queries.

The examples below focus primarily on resource monitoring queries for the DB SQL Warehouses. Similar queries could be applied to other types of compute also. Note that some of the tables (such as the query table) were in public preview at the time of writing, but should be available to activate shortly.

Query 1: Resource Monitoring - DB SQL Warehouses that are actively running and for how long

This query helps understanding which warehouses are currently active and their running time in hours. 

It is specifically useful in bigger organizations where admins struggle to understand usage patterns, making it hard to keep resource usage efficient. This query helps them get an overview of current usage and thus supports taking good decisions on cluster policies.

SELECT
    we.workspace_id,
    we.warehouse_id,
    we.event_time,
    TIMESTAMPDIFF(MINUTE, we.event_time, CURRENT_TIMESTAMP()) / 60.0 AS running_hours,
    we.cluster_count
FROM
    system.compute.warehouse_events we
WHERE
    we.event_type = 'RUNNING'
    AND NOT EXISTS (
        SELECT 1
        FROM compute.warehouse_events we2
        WHERE we2.warehouse_id = we.warehouse_id
        AND we2.event_time > we.event_time
    )

A sample output can be found below - 

mattiazeni_3-1723102044319.png

Alerting Opportunity: As a DBSQL Admin, you might want to be alerted if a warehouse is running longer than expected. You can use the query results to set an alert condition such as triggering an alert when the running hours exceed a certain threshold. As a result, the admin can understand if the warehouse is currently used and if not, he can turn it off manually. Whenever a new user runs a query the Serverless Warehouse will start in a matter of seconds.

Query 2: Resource Monitoring - Warehouses that are upscaled longer than expected

This query identifies warehouses that have scaled up (increased in cluster count) and have remained in that state longer than usual.

It allows admins to identify not ideal usage patterns and take action, e.g., reduce the maximum cluster count.

SELECT
   we.workspace_id,
   we.warehouse_id,
   we.event_time,
   TIMESTAMPDIFF(MINUTE, we.event_time, CURRENT_TIMESTAMP()) / 60.0 AS upscaled_hours,
   we.cluster_count
FROM
   system.compute.warehouse_events we
WHERE
   we.event_type = 'SCALED_UP'
   AND we.cluster_count >= 2
   AND NOT EXISTS (
       SELECT 1
       FROM compute.warehouse_events we2
       WHERE we2.warehouse_id = we.warehouse_id
       AND (
           (we2.event_type = 'SCALED_DOWN') OR 
           (we2.event_type = 'SCALED_UP' AND we2.cluster_count < 2)
       )
       AND we2.event_time > we.event_time
   )

A sample output can be found below - 

mattiazeni_4-1723102044316.png

Alerting Opportunity: Alerting on this condition can help you monitor resources and cost. You could set an alert when the upscaled hours exceed a certain limit.


Query 3 - When would increasing t-shirt size for a warehouse lead to better performance?

This query is looking for disk spill, which is one of the best indicators for the need to increase t-shirt size in a warehouse. If you notice a warehouse is spilling to disk often, this means that the higher speed memory on the warehouse is full and the warehouse is having to write to disk (slower and less efficient). Increasing machine size in these cases will lead to better query performance and consequently better user experience while decreasing cluster run time.

SELECT 
    workspace_id, 
    warehouse_id, 
    COUNT(*) as frequency, 
    MAX(spilled_local_bytes) AS spilled_bytes
FROM `query`.`history`
WHERE start_time between DATE_SUB(CURRENT_DATE, 30) AND CURRENT_DATE 
AND spilled_local_bytes > 0
GROUP BY workspace_id, warehouse_id 
HAVING spilled_bytes > 0
ORDER BY frequency DESC;

A sample output can be found below - 

mattiazeni_5-1723102044282.png

 

Query 4 - Suspects for query inefficiency 

Queries that have large amounts of shuffle are a good place to look for inefficiently written queries or table structure. Once users know which queries are inefficient they can modify and improve them. The benefit of having efficient queries boils down to less compute time or the possibility to decrease the t-shirt size of a warehouse, resulting in less costs or more performance for the final users.

The query below shows the potentially most inefficient queries from the last 30 days (based on shuffle).

SELECT
    workspace_id,
    warehouse_id,
    statement_id, 
    statement_text,
    SUM(shuffle_read_bytes) AS shuffle_read_bytes
FROM system.query.history
WHERE start_time between DATE_SUB(CURRENT_DATE, 30) AND CURRENT_DATE
GROUP BY workspace_id, warehouse_id, statement_id, statement_text
HAVING shuffle_read_bytes > 0
ORDER BY shuffle_read_bytes DESC;

A sample output can be found below - 

mattiazeni_6-1723102044308.png

 

Cost Management

Monitoring costs is an important aspect of a Data Platform. Customers want to make sure the current spend is on track with respect to estimates and budgets. With System Tables you can achieve this easily and give FinOps teams a complete view of the current spend. You can even give them the freedom to operate independently and do their own analysis by assigning them permissions to access the dedicated tables with a couple of clicks in Unity Catalog.

System Tables that allow you to track your current spend are those available in the billing schema. At the time of writing there are two tables available: usage and list_prices.

  • usage allows to view your account’s global usage from whichever region your workspace is in;
  • list_prices gives you access to a historical log of Databricks pricing by SKU. A record gets added each time there is a change to a SKU price.

What are the most common questions our customers worldwide answer with the billing System Tables?

 

Question 5: What is the daily spend on the different Databricks products?

This query identifies the consumption generated in DBUs and $DBUs at list price differentiated by SKU on a daily basis.

Admins are able to understand how users use their platform. If a new DWH migration went live last month, they should expect a spike in SQL or ETL DBUs while a spike in machine learning costs might trigger an investigation.

SELECT
    u.usage_date,
    u.sku_name,
    SUM(u.usage_quantity) AS dbus,
    FIRST(lp.pricing.default) AS list_price,
    SUM(u.usage_quantity) * FIRST(lp.pricing.default) AS dollar_dbus_list
FROM system.billing.usage u
INNER JOIN system.billing.list_prices lp ON u.sku_name = lp.sku_name
    AND u.usage_start_time >= lp.price_start_time
    AND (
        u.usage_end_time <= lp.price_end_time
        or lp.price_end_time is null
    )
GROUP BY u.usage_date, u.sku_name
ORDER BY u.usage_date, u.sku_name DESC

Question 6: What is the spend on the different Databricks products per month per workspace?

This query identifies the consumption generated in DBUs and $DBUs at list price on a monthly basis by every workspace.

This query is a finer grade version of the previous one, with aggregations by workspace.

SELECT
    LEFT(u.usage_date, 7) AS year_month,
    u.workspace_id,
    u.sku_name,
    SUM(u.usage_quantity) AS dbus,
    FIRST(lp.pricing.default) AS list_price,
    SUM(u.usage_quantity) * FIRST(lp.pricing.default) AS dollar_dbus_list
FROM
    system.billing.usage u
INNER JOIN system.billing.list_prices lp ON u.sku_name = lp.sku_name
    AND u.usage_start_time >= lp.price_start_time
    AND ( u.usage_end_time <= lp.price_end_time OR lp.price_end_time is null)
GROUP BY u.workspace_id, LEFT(u.usage_date, 7), u.sku_name
ORDER BY year_month, u.workspace_id DESC

Question 7: How much is the pipeline with job_id 652895398180890 costing per month?

This query identifies the monthly consumption generated in DBUs and $DBUs by a specific Databricks Workflow.

Admins might engage Data Engineers to look at some heavy pipeline to optimize it.

SELECT
    LEFT(u.usage_date, 7) AS year_month,
    SUM(u.usage_quantity) AS dbus,
    FIRST(lp.pricing.default) AS list_price,
    SUM(u.usage_quantity) * FIRST(lp.pricing.default) AS dollar_dbus_list
FROM
    system.billing.usage u
INNER JOIN system.billing.list_prices lp ON u.sku_name = lp.sku_name
    AND u.usage_start_time >= lp.price_start_time
    AND (
       u.usage_end_time <= lp.price_end_time OR lp.price_end_time is null
       )
WHERE usage_metadata.job_id = '652895398180890'
GROUP BY LEFT(u.usage_date, 7)
ORDER BY LEFT(u.usage_date, 7) DESC

Auditing

Auditing is pivotal in any information system and this is specifically true for Data Platforms that handle sensitive business information. There are several reasons why this is the case, for compliance, security, accountability, and incident response, among others. Unity Catalog facilitates compliance with regulatory standards by providing an exhaustive list of operations that moved, transformed or exposed your data via the audit System Table, available under the access catalog.

What are the most common questions our customers worldwide answer with the access System Tables?

Question 8: What are the top 10 tables accessed by user email@databricks.com?

This query identifies which are the top 10 most accessed tables by user email@databricks.com. Admins might reach out to such users to provide support to optimize their queries if needed.

Access to System Tables can be restricted respecting native Unity Catalog permissions. Matastore admins can restrict access to a specific set of users. Alternatively, Views can be created on top to give dynamic access based on the role of the user who queries. This is useful in situations such as this one where PII data, i.e., the email address, is proposed as part of the results since not everyone in the company might be allowed to see it. For example, if an admin runs this query then the email address is shown in clear, otherwise a hash is returned or the column is hidden completely.

SELECT
    IFNULL(request_params.full_name_arg, 'Non-specific') AS `TABLE ACCESSED`
FROM system.access.audit
WHERE user_identity.email = 'email@databricks.com'
    AND action_name IN ('commandSubmit','getTable')
GROUP BY IFNULL(request_params.full_name_arg, 'Non-specific')
ORDER BY COUNT(*) DESC
LIMIT 10

Question 9: Who are the top 10 users who access table catalog.schema.table?

This query identifies which are the top 10 users accessing table catalog.schema.table.

SELECT
    COUNT(*),
    user_identity.email
FROM
    system.access.audit
WHERE
    action_name IN ('commandSubmit', 'getTable')
    AND request_params.full_name_arg = 'catalog.schema.table'
GROUP BY user_identity.email
ORDER BY COUNT(*) DESC
LIMIT 10

Question 10: What has this user accessed in the last 24 hours?

This query identifies which tables user email@databricks.com accessed in the last 24 hours

SELECT
    IFNULL(request_params.full_name_arg, 'Non-specific') AS `table_accessed`
FROM system.access.audit
WHERE user_identity.email = 'email@databricks.com'
    AND action_name IN ('commandSubmit','getTable')
    AND event_time >= NOW() - '1 day'::INTERVAL
GROUP BY IFNULL(request_params.full_name_arg, 'Non-specific')
ORDER BY COUNT(*) DESC

Conclusion

In conclusion, the Databricks Unity Catalog significantly enhances data governance within the Data Intelligence Platform by offering advanced AI-powered tools for managing both structured and unstructured data, as well as machine learning models, notebooks, dashboards, and files across any cloud or platform. Central to this capability are the System Tables, which are stored and governed on Delta Lake, providing a robust foundation for platform observability and enabling scalable operational intelligence. This infrastructure empowers businesses to address critical inquiries regarding their data platform's status and health. By utilizing a comprehensive set of queries, organizations can effectively monitor and analyze their Data Intelligence Platform, ensuring optimal performance and strategic decision-making.

3 Comments