Databricks Data Warehousing Announcements— July 2024
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
‎08-20-2024 01:18 AM
Predictive Optimisation
Predictive Optimisation is in GA, which uses AI to understand the maintenance operations required from Unity Catalog (eg: data access patterns) and automatically runs optimisations on your data layouts to improve query performance. This removes manual overhead of scheduling optimisation jobs with considerations around frequency, type of optimisation, tables are automatically managed
Cost Management Dashboards
This is in Public Preview. Account admins can now import dashboards to monitor costs at either an account level or on a workspace level. Use the dashboard to view the metrics below, with the option to fully customise the dashboard
- Usage breakdown by SKU name
- Usage analysis based on custom tags
- Usage analysis on the most expensive usage
- Usage breakdown by billing origin product
System Table updates
There are various updates around system tables, which is Databricks storage of operational data for observability:
- Databricks Assistant system tables in public preview: Track the usage of Databricks assistant through system.access.assistant_events table, which will record the workspace, datetime, and the email of the user initiating a message on assistant.
- Node timeline system tables are in public preview: The node timeline table provides node level utilisation at minute granularity. Monitor metrics such as node type, cpu & memory utilisation, as well as network traffic sent in bytes.
- Query history system tables in public preview: The system.query.history table records every SQL statement that has ran via SQL warehouses, where metrics such as the SQL statement, the warehouse id, execution duration, bytes read etc are available.
- Billing system tables are enabled by default in all Unity catalog workspaces. Billing tables allow you to get an overview of usage by SKU, duration etc
- Workflows system tables in public preview: There are 4 tables in the system.workflow schema, which allows you to monitor:
- jobs: tracks creation, deletion & basic information of all jobs
- job_tasks: tracks creation, deletion & basic information of all job tasks
- jobs_run_timeline: records the start, end and resulting state of job runs
- job_task_run_timeline: records the start, end, and resulting state of job tasks
Primary Key and Foreign Key constraints are GA and now enable faster queries
Primary keys (PK) and foreign keys (FK) can be defined for Unity Catalog tables for data modeling purposes. You can define it as a constraint during table creation or with modification. Do note that primary and foreign key constraints are currently not enforced. These are mainly used to indicate data integrity relationship, which also gives end users the ability to view the constraints in Unity Catalog via an Entity Relationship Diagram (ERD)
For valid primary keys, using the RELY option allows you to enable optimisations based on constraints as Databricks will factor in data integrity of the primary key declared into query plans to optimize queries
One optimization RELY enables is that it can eliminate unnecessary aggregates based on the primary key constraints. For example, if a distinct operation is ran over the table with a primary key using RELY, the unnecessary distinct operation is removed, which speeds up the query by 2x
Another optimization from RELY is removing unnecessary joins. If a query joins a table which is only referenced in the join condition, the primary key constraint present would indicate that the join will output one row, which in turn would help the query optimizer identify instances where it can eliminate the join from the query entirely. In the blog example, the optimization sped up the query from 1.5 minutes to 6 seconds!
Read the full blog post here!
