Databricks Community

Beatrice_Liew · ‎08-20-2024

Predictive Optimisation

Predictive Optimisation is in GA, which uses AI to understand the maintenance operations required from Unity Catalog (eg: data access patterns) and automatically runs optimisations on your data layouts to improve query performance. This removes manual overhead of scheduling optimisation jobs with considerations around frequency, type of optimisation, tables are automatically managed

Cost Management Dashboards

This is in Public Preview. Account admins can now import dashboards to monitor costs at either an account level or on a workspace level. Use the dashboard to view the metrics below, with the option to fully customise the dashboard

Usage breakdown by SKU name
Usage analysis based on custom tags
Usage analysis on the most expensive usage
Usage breakdown by billing origin product

System Table updates

There are various updates around system tables, which is Databricks storage of operational data for observability:

Databricks Assistant system tables in public preview: Track the usage of Databricks assistant through system.access.assistant_events table, which will record the workspace, datetime, and the email of the user initiating a message on assistant.
Node timeline system tables are in public preview: The node timeline table provides node level utilisation at minute granularity. Monitor metrics such as node type, cpu & memory utilisation, as well as network traffic sent in bytes.
Query history system tables in public preview: The system.query.history table records every SQL statement that has ran via SQL warehouses, where metrics such as the SQL statement, the warehouse id, execution duration, bytes read etc are available.
Billing system tables are enabled by default in all Unity catalog workspaces. Billing tables allow you to get an overview of usage by SKU, duration etc
Workflows system tables in public preview: There are 4 tables in the system.workflow schema, which allows you to monitor:
jobs: tracks creation, deletion & basic information of all jobs
job_tasks: tracks creation, deletion & basic information of all job tasks
jobs_run_timeline: records the start, end and resulting state of job runs
job_task_run_timeline: records the start, end, and resulting state of job tasks

Primary Key and Foreign Key constraints are GA and now enable faster queries

Primary keys (PK) and foreign keys (FK) can be defined for Unity Catalog tables for data modeling purposes. You can define it as a constraint during table creation or with modification. Do note that primary and foreign key constraints are currently not enforced. These are mainly used to indicate data integrity relationship, which also gives end users the ability to view the constraints in Unity Catalog via an Entity Relationship Diagram (ERD)

For valid primary keys, using the RELY option allows you to enable optimisations based on constraints as Databricks will factor in data integrity of the primary key declared into query plans to optimize queries

One optimization RELY enables is that it can eliminate unnecessary aggregates based on the primary key constraints. For example, if a distinct operation is ran over the table with a primary key using RELY, the unnecessary distinct operation is removed, which speeds up the query by 2x

Another optimization from RELY is removing unnecessary joins. If a query joins a table which is only referenced in the join condition, the primary key constraint present would indicate that the join will output one row, which in turn would help the query optimizer identify instances where it can eliminate the join from the query entirely. In the blog example, the optimization sped up the query from 1.5 minutes to 6 seconds!

Read the full blog post here!