cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

System table missing primary keys?

aranjan99
Contributor

This simple query takes 50seconds for me on a X-Small warehouse.

select * from SYSTEM.access.workspaces_latest where workspace_id = '442224551661121'

Can the team comment on why querying on system tables takes so long? I also dont see any primary keys set on this table, is that intentional?

6 REPLIES 6

pradeep_singh
Contributor

Is it the case with initial reads only or the subsequent reads are slow as well .The table lives in a Databricksโ€‘hosted account and is shared to your metastore. A first query often pays the cost of establishing the share connection .

https://learn.microsoft.com/en-us/azure/databricks/admin/system-tables/#where-is-system-table-data-s...

Thank You
Pradeep Singh - https://www.linkedin.com/in/dbxdev

iyashk-DB
Databricks Employee
Databricks Employee

System tables are a Databricksโ€‘hosted, readโ€‘only analytical store shared to your workspace via Delta Sharing; they arenโ€™t modifiable (no indexes you can add), and the first read can have extra overhead on a very small warehouse. This can make โ€œsimpleโ€ selects feel slow on Xโ€‘Small, especially if the warehouse was cold or the table had to be hydrated from the sharing layer.

If possible, run on a larger or Serverless warehouse for interactive reads; Xโ€‘Small is costโ€‘efficient but can have higher perโ€‘query latency, particularly on first access to shared datasets and avoid SELECT *; project only the columns you need to minimise bytes scanned.

I ran the query on a serverless warehouse and it was not a cold start because the warehouse was already up and running other queries. Even on a medium warehouse the query still takes about 40 sec. On the note of first query, if I run the exact same query again it is faster possibly because it gets cached. But if I do a lookup on another workspaceID it again takes 40sec to run the query. The table only has 2 rows for us and hence it is very surprising to see these simple queries take 40-50seconds. I also tried selecting only two fields instead of select *, but that doesnt change the time it takes



aranjan99
Contributor

any other suggestion on how to optimize this?

aranjan99
Contributor

When I run describe extended on the table it shows Statistics as 308105571 bytes, 144413 rows which is the real reason why this query is so slow. Even though count(*) returns 1 on this table the query is scanning almost 0.3GB of data as it a delta share. Can you share how often VACUUM is run on these tables?

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @aranjan99,

There are two separate topics here, so let me address each one.

WHY THE QUERY IS SLOW (~50 SECONDS ON X-SMALL)

System tables are served via Delta Sharing from a Databricks-hosted storage account in the same region as your Unity Catalog metastore. Because of this architecture, the first query against a system table in a session typically incurs a "cold start" overhead: the warehouse needs to establish the Delta Sharing connection, list the remote files, and pull the data across. On an X-Small SQL warehouse, that startup cost is more pronounced because of the smaller compute footprint.

A few things that can help:

1. Use a Serverless SQL Warehouse if available. Serverless warehouses have optimized networking paths and start faster, which reduces overhead on system table queries.

2. Warehouse size matters for first-query latency. An X-Small warehouse has minimal resources, so the initial file listing and data fetch take longer. Once the data is cached in the warehouse, subsequent queries against the same table in the same session will be significantly faster.

3. Filter early and select only needed columns. Instead of SELECT *, try selecting just the columns you need:

SELECT workspace_id, workspace_name, workspace_url, status
FROM system.access.workspaces_latest
WHERE workspace_id = '442224551661121'

4. Consider caching results locally. If you query system tables frequently for operational dashboards, consider writing the results to a local Delta table on a schedule (for example, a nightly job). That way your dashboards read from a local table with full query performance, and only the scheduled job pays the Delta Sharing overhead.

5. Check for concurrent load. If the warehouse is also serving other queries, an X-Small may be resource-constrained. Ensure the warehouse is not under heavy concurrent load during your test.

The workspaces_latest table is typically small (one row per active workspace in your account), so 50 seconds is almost certainly dominated by connection setup and cold start rather than data volume.

WHY THERE ARE NO PRIMARY KEYS

System tables are read-only Delta tables hosted by Databricks and shared via Delta Sharing. They do not have primary key or foreign key constraints defined, and this is by design for a few reasons:

- Primary keys in Databricks are informational, not enforced. Even on user-created Delta tables, primary key constraints serve as metadata hints for query optimizers and BI tools. They do not enforce uniqueness at write time. Since system tables are read-only and managed entirely by Databricks, adding informational primary keys would not change query performance or data integrity.

- Delta Sharing tables have limited DDL support. Because system tables are shared from a provider (Databricks) to your account as a recipient, constraint metadata is not propagated through the Delta Sharing protocol.

- The data model uses natural keys. For system.access.workspaces_latest, the workspace_id column is effectively the natural unique key (one row per workspace). You can rely on this for joins and lookups even without a formal constraint declaration.

If you need primary key metadata for documentation or BI tool integration, you can create a view on top of the system table in your own catalog and define constraints on a materialized copy. For example:

CREATE TABLE my_catalog.my_schema.workspaces_snapshot AS
SELECT * FROM system.access.workspaces_latest;

ALTER TABLE my_catalog.my_schema.workspaces_snapshot
ADD CONSTRAINT pk_workspace PRIMARY KEY (workspace_id);

REFERENCE DOCUMENTATION

- System tables overview: https://docs.databricks.com/aws/en/admin/system-tables/
- Workspaces system table: https://docs.databricks.com/aws/en/admin/system-tables/workspaces.html
- Table constraints (primary keys, foreign keys): https://docs.databricks.com/aws/en/tables/constraints.html

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.