cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
EduardoLomonaco
Databricks Employee
Databricks Employee

Overview

In Databricks Unity Catalog, tags are the primary mechanism for scaling data discovery and attribute-based access control (ABAC). However, without a clear strategy, tag usage often becomes inconsistent, leading to "metadata swamps" where policies are difficult to enforce, and data is challenging to find.

This guide outlines a standardized approach to tagging, covering the architecture of Governed vs. non-governed tags, essential naming conventions, and the SQL patterns you need to audit your environment.

Architecture of Governed tags

To strike a balance between strict compliance and developer flexibility, successful governance teams divide their tagging strategy into two distinct tiers. This prevents a "one-size-fits-all" approach from slowing down innovation while ensuring that high-stakes security and financial attributes remain tightly controlled.

Instead of allowing teams to invent their own taxonomy from scratch, organizations should deploy a standard set of "Core Tags" that every data product must possess. By explicitly defining which tags are Governed or Non-Governed, you create a clear boundary between account-wide policy enforcement and team-specific context. A core capability is the ability to convert Non-Governed tags into Governed tags after assignment simply by defining them at the account level.

To maintain a high-quality metadata environment, align your team around these two tiers:

  • The Governance Core (Governed tags): Use these for cross-cutting, high-stakes attributes that drive Attribute-Based Access Control (ABAC), compliance, and cost tracking. Governed tags allow you to define a static list of allowed values (e.g., environment must be dev, test, or prod). Because you control who can apply or modify these tags, they serve as a reliable guardrail against unauthorized changes to compliance policies.
  • The Context Layer (Non-Governed tags): These provide the flexibility needed for ad hoc, team-specific, or project-specific markers. While these tags do not have restricted value lists, they must still adhere to your "Tagging Conventions" (e.g., lowercase snake_case) to remain searchable. This layer enables teams to move quickly without waiting for central approval on every new metadata requirement.

 

Tier

Purpose

Examples (Key: Allowed Values)

Governed

Policy & Compliance

  • data_classification: (public, internal, confidential, restricted)
  • environment: (dev, test, prod)
  • retention_policy: (short_term, standard, legal_hold)
  • lifecycle_status: (draft, active, certified, deprecated)
  • owner_lob: (finance, sales, marketing, it)

Non-Governed

Context & Agility

  • project_code
  • initiative
  • backlog_item
  • line_of_business
  • on_call_owner

Naming Standards & Constraints

Adopting a strict "Tagging Conventions" ensures your metadata remains portable, queryable, and effective for automation. Because Unity Catalog tags serve as the primary mechanism for scaling data discovery and Attribute-Based Access Control (ABAC), inconsistency is a structural issue that can lead to broken security policies and "metadata swamps." 

Tags should be short, descriptive, and avoid ambiguous abbreviations that may confuse users or overlap with system-generated metadata.

To maintain a high-quality metadata environment, align your team around these core principles:

  • Enforce a convention (e.g., snake case, camel case, kebab case): Always enforce a letter casing and separator convention (e.g., all lowercase and underscores: retention_policy, cost_center) to ensure consistency across different teams and automated workflows. This prevents duplicate variations and ensures that a policy looking for data_owner doesn't miss an object tagged as Data_Owner.
  • Avoid System Conflicts: Refrain from using potentially reserved or generic terms like name, id, system, or certified unless they are explicitly aligned with Databricks system tags.
  • Prioritize Low Cardinality: Use tags for categories and groups rather than unique identifiers or timestamps; tags are for classification, not for tracking individual execution instances. Encoding unique IDs, such as RunID or TransactionID, into tags creates high-cardinality "noise" that clutters the billing system and compromises the ability to perform meaningful metadata aggregations.
  • Protect Sensitive Data: Keep PII and confidential business logic out of tag keys and values, as tags are not encrypted and are broadly accessible.

Platform Limitations & Constraints

When programmatically applying tags, you must adhere to the following hard constraints enforced by Unity Catalog (constraints listed below are subject to change and should be re-verified in the constraints link):

  • Governed Tags: an account can have up to 1000 Governed Tags and each may have up to 50 values.
  • Character Limits: Tag keys are capped at 255 characters, while tag values can reach a maximum of 1,000 characters.
  • Prohibited Characters: You cannot use special characters such as . , - = / : in tag keys, nor are leading or trailing spaces allowed in keys or values.
  • Tag keys are case sensitive (for example, Owner and owner are distinct keys).
  • Object Capacity: Each securable object (table or column) is limited to a maximum of 50 tags; additionally, a single table cannot exceed 1,000 total column tags across its entire schema.
  • Search Behaviour: While SQL allows broad querying, the Tag search in the Workspace Search UI supports all Unity Catalog objects except functions and requires exact term matching.

The Organizational Blueprint: Implementing Tags at Scale

Organizational adoption is arguably the most challenging part of implementing tags, as they cannot be successfully deployed in isolation. To be effective, tags must be defined centrally to ensure consistency, yet leveraged in a distributed way by the teams building the data. Without this balance, your tagging efforts will remain a technical exercise rather than a functional governance tool.

Success requires a unified "Tagging Strategy" that stakeholders across data engineering, security, and business units agree upon. This blueprint ensures that as your Databricks environment grows, your metadata remains a reliable map rather than a collection of fragmented labels.

Before implementing your tags, use this alignment checklist to drive the decision-making process:

  • Define Scope and Priority: Decide which securable objects you will tag first (Catalogs, Schemas, Tables, or Volumes). Prioritize the layers where Attribute-Based Access Control (ABAC), discovery, and lineage deliver the most immediate value. Evaluate your mandate and sponsorship to determine the level of comprehensiveness required for your tag implementation.
  • Establish Control and Permissions: Determine who is authorized to create, modify, and assign Governed Tag definitions. Clearly define assignment permissions to ensure that only designated stewards/owners can apply high-stakes tags like data_classification.
  • Leverage ABAC Inheritance: When setting up policy boundaries, prefer applying coarse-grained tags at the Catalog or Schema level. These are inherited for ABAC policy evaluation for all nested objects, significantly simplifying your security model. 
  • Choose the Source of Truth: For production environments, manual UI tagging does not scale. Manage your Governed Tag definitions and assignments via Infrastructure as Code (IaC), such as Terraform, to ensure versioning, peer reviews, and state consistency.
  • Standardize the Request Workflow: Create a clear, simple process for how teams request new tags or values. Document your reserved lists and naming conventions in a "one-page policy" that is easily accessible to all developers.
  • Operational Maintenance: Tagging is not a "set and forget" task. Establish a fixed cadence for audits and cleanups to deprecate obsolete tags and ensure teams aren't bypassing the established taxonomy.

Implementation at Scale: Infrastructure as Code (IaC)

While it is possible to create and apply tags manually through the Unity Catalog UI, this approach is insufficient for enterprise environments. Manual tagging is prone to human error, lacks a versioned history, and becomes an operational bottleneck as the number of catalogs and schemas grows. To achieve true scalability and consistency, you should treat your "Tagging Strategy" as code.

The most effective method for managing Governed Tags is through an Infrastructure as Code (IaC) framework, specifically Terraform. By using the Databricks Terraform provider, you can define your tag taxonomy in configuration files, allowing for peer reviews via Pull Requests and ensuring that the state of your metadata remains consistent across development, testing, and production environments.

To scale your tagging implementation effectively, align your technical workflow with these core principles:

  • Standardize with Tag Policies: Use the databricks_tag_policy resource to define the structure of your Governed Tags at the account level. This resource allows you to bake your naming conventions and allowed values directly into the platform, preventing users from creating non-compliant tags.
  • Decouple Assignment from Definition: Leverage resources like databricks_entity_tag_assignment for workspace-level objects (catalogs, schemas, tables) or databricks_workspace_entity_tag_assignment for workspace-specific assets. This modular approach allows you to update a tag’s value across hundreds of objects by changing a single line of code.
  • Version Your Taxonomy: Storing your tag definitions in a Git repository provides a full audit trail of who changed a tag value and why. This is essential for compliance-heavy industries where changes to data_classification or retention_policy must be documented.
  • Use the Databricks SDK for Custom Automation: While Terraform is the gold standard for stateful resources, the Databricks SDK (Python/Go/Java) and REST API are powerful alternatives for dynamic, event-driven tagging—such as applying "Non-Governed tags" automatically upon the completion of a DLT (Delta Live Tables) pipeline.

Platform Limitations & Constraints

When automating your tagging via IaC, keep these operational boundaries in mind:

  • State Management: When using Terraform, avoid "mixing and matching" UI-based tagging with code-based tagging. Manual changes in the UI will cause "state drift," which Terraform will attempt to overwrite during the next deployment.
  • Provider Permissions: Ensure the Service Principal used by your IaC pipeline has the CREATE permission at the account level and sufficient privileges (typically APPLY TAG/ASSIGN TAG on the object plus USE SCHEMA and USE CATALOG on its parents) on the target objects to apply the assignments.

The "Boolean Pattern" for Multi-Value Tags

An architectural constraint in Unity Catalog is that a securable object can only be assigned one value for any specific Governed Tag key at a time. For example, if you create a tag called accessible_teams with allowed values like engineering, sales, and marketing, you cannot assign both engineering and sales to a single table. Attempting to circumvent this by using comma-separated strings (e.g., engineering, sales) breaks the enforcement of Governed Tag "allowed values" and makes ABAC policy logic unnecessarily complex.

To support multiple attributes for the same category, you should adopt the Boolean Pattern. Instead of a single multi-value key, structure your tags as distinct boolean flags. This allows you to apply multiple attributes to a single object while keeping your security logic simple, predictable, and strictly enforced.

To effectively bypass single-value limitations while maintaining governance, align your implementation with these patterns:

  • Deconstruct Overlapping Categories: If a dataset falls under multiple regulatory or business domains, model them as separate keys (for example, contains_pii [personally identifiable information], contains_pci [payment card information], contains_phi [protected health information]).
  • Use Boolean Flags (or Key Only Tags): You can use true or false for your tag values to set them as boolean tags. This ensures that your Attribute-Based Access Control (ABAC) policies can utilize simple SQL predicates, such as WHERE tag_value = 'true', which is far more performant than string matching or parsing lists. Another possibility is to use Key Only Tags with the hasTag condition.
  • Simplify Access Logic: By using distinct keys, you can create granular policies. For example, you can grant a user access to PII data without accidentally granting access to PCI data on the same resource—a task that is notoriously difficult with multi-value strings.

The Governance Toolkit: Essential SQL Audits

A tagging strategy is only as effective as your ability to audit it. In Unity Catalog, the system.information_schema provides a live, centralized view of every tag applied across your catalogs, schemas, tables, and columns. By leveraging these system tables, you can move from reactive troubleshooting to proactive compliance monitoring.

These SQL patterns enable you to identify "metadata rot," detect naming convention violations, and ensure that your high-stakes Governed Tags are applied where they are most needed.

To maintain the integrity of your metadata ecosystem, incorporate these audit patterns into your regular governance cadence:

  • Monitor for Cardinality Creep: Use distinct value counts to find tags being misused as unique identifiers (e.g., run_id). Any tag exceeding 100 distinct values is a candidate for deprecation, as high cardinality compromises billing aggregation and metadata clarity.
  • Identify Mandatory Tag Gaps: Don't wait for a security audit to discover untagged data. Programmatically find tables that lack critical "Core Tags" like data_classification or data_owner to ensure ABAC policies remain effective.
  • Normalize Case and Character Violations: Use regex-based queries to identify "shadow" tags (e.g., Owner vs owner) or tags using prohibited characters like dots or dashes. This ensures your environment remains strictly snake_case.
  • Track Resource Limits: Proactively monitor objects approaching the hard limit of 50 tags per object or 1,000 column tags per table to prevent downstream Infrastructure as Code (IaC) deployment failures.

Audit Cookbook

  • Detect High-Cardinality Abuse Identify tags acting as unique IDs rather than grouping categories.
SELECT tag_name, COUNT(DISTINCT tag_value) AS distinct_values
FROM system.information_schema.table_tags
GROUP BY tag_name
HAVING COUNT(DISTINCT tag_value) > 100
ORDER BY distinct_values DESC;
  • Identify Casing and Character Violations Find tags that violate the snake_case standard or contain prohibited characters (e.g., . , - = / :).
-- Detect non-snake_case keys
SELECT DISTINCT tag_name
FROM system.information_schema.table_tags
WHERE NOT (tag_name RLIKE '^[a-z0-9_]+$');

-- Detect prohibited special characters
SELECT DISTINCT tag_name
FROM system.information_schema.table_tags
WHERE tag_name RLIKE '[\\.,\\-=/:]';
  • Find Objects Missing Mandatory Tags Identify tables that fail compliance because they lack a required Governed Tag.
WITH required(tag_name) AS (
  SELECT 'data_owner' UNION ALL SELECT 'data_classification'
),
all_tables AS (
  SELECT table_catalog, table_schema, table_name
  FROM system.information_schema.tables
  WHERE table_type IN ('BASE TABLE','VIEW','MATERIALIZED VIEW')
)
SELECT t.table_catalog, t.table_schema, t.table_name, r.tag_name AS missing_tag
FROM all_tables t
CROSS JOIN required r
LEFT ANTI JOIN system.information_schema.table_tags g
  ON g.catalog_name = t.table_catalog
 AND g.schema_name  = t.table_schema
 AND g.table_name   = t.table_name
 AND g.tag_name     = r.tag_name;
  • Generate Cleanup Commands Quickly generate the exact SQL needed to remove non-standard tag variants.
SELECT CONCAT(
  'UNSET TAG ON TABLE ', catalog_name, '.', schema_name, '.', table_name, ' ', tag_name, ';'
) AS cleanup_cmd
FROM system.information_schema.table_tags
WHERE tag_name IN ('Owner', 'OWNER', 'Env');

SQL Manual Implementation Syntax

The SQL syntax for applying and removing tags depends on your Databricks Runtime (DBR) version. While the legacy syntax remains common, we recommend adopting the Standard Syntax for its improved readability and consistency with other UC securable object commands.

  • Adopt SET/UNSET for Modern Runtimes: For DBR 16.1+, use the simplified SET TAG and UNSET TAG syntax. It is more intuitive and reduces the risk of syntax errors in automation scripts.
  • Use Backticks for Reserved Keywords: If your tag key happens to overlap with a SQL reserved word, always wrap the key and value in backticks (e.g., `environment` = `prod`) to ensure successful execution.
  • Prioritize Column-Level Tagging for Sensitivity: Use column-level tags specifically for sensitivity markers (like contains_pii). This allows you to build fine-grained Dynamic View policies that mask data based on these specific attributes.

Standard syntax (Databricks Runtime 16.1+; for Databricks Runtime 13.3+ see documentation).

-- Apply a tag to a table
SET TAG ON TABLE main.finance.orders cost_center = finance_dept;
-- Remove a tag from a specific column
UNSET TAG ON COLUMN main.finance.orders.ssn contains_pii;

By combining these SQL patterns with a robust Infrastructure as Code (IaC) workflow, your organization can transform Unity Catalog tags from simple labels into a powerful, automated engine for data governance and security.

6 Comments