Databricks Community

Sunil_Poluri · ‎07-07-2025

I've set up Unity Catalog with an external location pointing to a storage account. For each schema, I’ve configured a dedicated container path. For example:

abfss://schemas@<storage_account>.dfs.core.windows.net/_unityStorage/schemas/<schema_id>

When I create a schema, a schema_id is generated. I expect this schema_id to be reflected as a folder under the schema container path, like:

/_unityStorage/schemas/<schema_id>

However, I’ve noticed that this folder doesn’t appear immediately—presumably because no objects (like tables) exist yet.

Here’s what I’ve observed:

When I create a Delta table within the schema, I expect the table data to be stored under the schema’s storage path.
Similarly, when I create a DLT pipeline targeting the same schema, I expect the tables to be stored under the same schema path.
But instead, a new schema ID folder gets created in the storage account under the schema container—even though the schema name is the same.

My question is: Under what conditions does Unity Catalog generate a new schema_id folder in the storage account, even when the schema name hasn’t changed?

Any insights or documentation references would be greatly appreciated!

Louis_Frolio · ‎09-24-2025

Hey @Sunil_Poluri , I did some research (learned a few things) and here is what I found.

Unity Catalog manages cloud storage mapping for schemas using internal IDs (schema_id) to ensure data isolation, governance, and uniqueness within a metastore—even if schema names are the same across catalogs or across time. Here is a summary of the key factors that influence when new schema_id folders are created under an external location, even if the schema name hasn’t changed:

1. Schema Drop and Re-create

Behavior: Unity Catalog assigns a unique internal identifier (schema_id) to every schema when it is created.
If a schema is dropped and re-created—even if the name is identical—a new schema_id (and thus a new folder) is generated. Old object data persists in the previous folder, but new objects (managed tables) will write to the new schema_id directory.
Implication: This is the most common reason for seeing multiple schema_id folders for a schema name.

2. Publishing Tables via DLT or Pipelines

When using Databricks Delta Live Tables (DLT) pipelines, table storage always adheres to the current mapping of the schema’s internal ID. If a pipeline (or notebook) triggers creation of a schema that doesn’t yet exist (for example, by referencing it as a target), Unity Catalog creates a new schema and assigns a new schema_id.
If there was a schema deletion and subsequent re-creation outside your awareness (or automation runs at unpredictable times), this could result in the schema_id shifting even if the schema name appears constant.

3. Direct Versus Indirect Schema Creation Channels

Databricks workflows, DLT, Databricks Asset Bundles, and manual UI actions all use the same underlying APIs, but automation (for example, CI/CD-driven schema creation in Asset Bundles or infrastructure-as-code) can lead to unintentional dropping and re-creating of schemas under the hood, causing new IDs to be assigned.
Mistakenly running schema creation logic without “IF NOT EXISTS” checks may inadvertently replace schemas and (re)generate schema_id folders.

4. Backing Storage or Location Changes

Changing the storage root location property on the schema or re-registering it can also be a scenario where a new schema_id is minted. However, most documentation and troubleshooting guidance emphasize schema drops and re-creations (planned or accidental) as primary drivers.

5. Multiple Metastores or Region/Workspace Boundaries

If running with multiple metastores or cross-region/catalog patterns, schemas with the same name in different metastores are always mapped to distinct internal IDs and thus distinct folders.

6. No Object, No Folder Until First Table

As noted, the schema_id folder is not created in the underlying storage until a managed object (such as a Delta table) is created within the schema. This lazy provisioning is expected behavior for storage efficiency.

Important Additional Notes

The internal IDs are not exposed in user-facing controls; only the folder names in storage and some low-level APIs reveal them. Schema_id changes are not triggered by table creation alone unless the schema itself is new (i.e., it did not exist at the time of table creation).
If you see unexpected new schema_id folders, audit logs, schema version histories, or CI/CD system activity may provide clues (look for drop/create activity).

Hope this helps with your understanding.

Cheers, Louis.

View solution in original post

Louis_Frolio · ‎09-24-2025

Hey @Sunil_Poluri , I did some research (learned a few things) and here is what I found.

Unity Catalog manages cloud storage mapping for schemas using internal IDs (schema_id) to ensure data isolation, governance, and uniqueness within a metastore—even if schema names are the same across catalogs or across time. Here is a summary of the key factors that influence when new schema_id folders are created under an external location, even if the schema name hasn’t changed:

1. Schema Drop and Re-create

Behavior: Unity Catalog assigns a unique internal identifier (schema_id) to every schema when it is created.
If a schema is dropped and re-created—even if the name is identical—a new schema_id (and thus a new folder) is generated. Old object data persists in the previous folder, but new objects (managed tables) will write to the new schema_id directory.
Implication: This is the most common reason for seeing multiple schema_id folders for a schema name.

2. Publishing Tables via DLT or Pipelines

When using Databricks Delta Live Tables (DLT) pipelines, table storage always adheres to the current mapping of the schema’s internal ID. If a pipeline (or notebook) triggers creation of a schema that doesn’t yet exist (for example, by referencing it as a target), Unity Catalog creates a new schema and assigns a new schema_id.
If there was a schema deletion and subsequent re-creation outside your awareness (or automation runs at unpredictable times), this could result in the schema_id shifting even if the schema name appears constant.

3. Direct Versus Indirect Schema Creation Channels

Databricks workflows, DLT, Databricks Asset Bundles, and manual UI actions all use the same underlying APIs, but automation (for example, CI/CD-driven schema creation in Asset Bundles or infrastructure-as-code) can lead to unintentional dropping and re-creating of schemas under the hood, causing new IDs to be assigned.
Mistakenly running schema creation logic without “IF NOT EXISTS” checks may inadvertently replace schemas and (re)generate schema_id folders.

4. Backing Storage or Location Changes

Changing the storage root location property on the schema or re-registering it can also be a scenario where a new schema_id is minted. However, most documentation and troubleshooting guidance emphasize schema drops and re-creations (planned or accidental) as primary drivers.

5. Multiple Metastores or Region/Workspace Boundaries

If running with multiple metastores or cross-region/catalog patterns, schemas with the same name in different metastores are always mapped to distinct internal IDs and thus distinct folders.

6. No Object, No Folder Until First Table

As noted, the schema_id folder is not created in the underlying storage until a managed object (such as a Delta table) is created within the schema. This lazy provisioning is expected behavior for storage efficiency.

Important Additional Notes

The internal IDs are not exposed in user-facing controls; only the folder names in storage and some low-level APIs reveal them. Schema_id changes are not triggered by table creation alone unless the schema itself is new (i.e., it did not exist at the time of table creation).
If you see unexpected new schema_id folders, audit logs, schema version histories, or CI/CD system activity may provide clues (look for drop/create activity).

Hope this helps with your understanding.

Cheers, Louis.