cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Regarding - Managed vs External volumes and tables

APJESK
Contributor


From a creation perspective, the steps for managed and external volumes appear almost identical:

  1. Both require a storage credential
  2. Both require an external location
  3. Both point to customer-owned S3

So what exactly makes a volume โ€œmanagedโ€ vs โ€œexternalโ€?

Why is it said that managed volumes are controlled by Databricks, while external volumes are not, when:

Both physically live in customer S3

Both are accessed using customer-defined IAM roles?

 

1 ACCEPTED SOLUTION

Accepted Solutions

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @APJESK,

You are right that the setup steps look similar on the surface, but the differences between managed and external volumes (and tables) are meaningful once you understand what Unity Catalog does with the data after creation.

WHAT "MANAGED" ACTUALLY MEANS

The word "managed" refers to lifecycle management, not physical location. In both cases the data lives in customer-owned cloud storage. The distinction is about who controls the directory structure and what happens when you drop the object.

MANAGED VOLUMES AND TABLES

- Unity Catalog chooses the storage path automatically. The data is placed under the managed storage location configured at the schema, catalog, or metastore level (in that priority order). The path includes a hashed subdirectory like __unitystorage/schemas/<UUID> that Unity Catalog controls.
- You do NOT specify a LOCATION when creating them.
- When you DROP a managed volume or table, Databricks deletes the underlying data files from cloud storage after a retention period (7 days for volumes, 8 days for tables). You can UNDROP within that window.
- Predictive optimization (automatic OPTIMIZE, VACUUM, ANALYZE) is available for managed tables.

Creation example for a managed volume:

CREATE VOLUME my_catalog.my_schema.my_managed_volume
    COMMENT 'Managed volume, no LOCATION needed';

Creation example for a managed table:

CREATE TABLE my_catalog.my_schema.my_managed_table (
    id BIGINT,
    name STRING,
    created_at TIMESTAMP
);

EXTERNAL VOLUMES AND TABLES

- You specify an existing cloud storage path using the LOCATION clause.
- That path must fall within a registered external location (which is backed by a storage credential).
- When you DROP an external volume or table, Unity Catalog removes only the metadata registration. The data files remain in cloud storage untouched. You must manually delete them if needed.
- External tools and systems can read/write to the same cloud path directly, outside of Databricks.

Creation example for an external volume:

CREATE EXTERNAL VOLUME my_catalog.my_schema.my_external_volume
    COMMENT 'Points to existing S3 path'
    LOCATION 's3://my-bucket/my-existing-data';

Creation example for an external table:

CREATE TABLE my_catalog.my_schema.my_external_table (
    id BIGINT,
    name STRING,
    created_at TIMESTAMP
)
LOCATION 's3://my-bucket/my-table-data';

WHY BOTH NEED STORAGE CREDENTIALS AND EXTERNAL LOCATIONS

This is the part that causes the confusion you described. Even for managed storage, Unity Catalog needs a storage credential (IAM role) to access the S3 bucket and the managed storage location must be contained within an external location. The difference is:

- For managed objects, you configure the managed storage location once at the metastore, catalog, or schema level. After that, individual CREATE statements do not reference any path.
- For external objects, you explicitly provide the LOCATION on every CREATE statement, pointing to a path within a registered external location.

So the storage credential and external location setup is shared infrastructure, but managed objects let Unity Catalog decide the exact path and handle the full data lifecycle, while external objects let you point to a path you control.

QUICK COMPARISON

Managed:
- Storage path: Chosen by Unity Catalog automatically
- DROP behavior: Data deleted after retention period
- LOCATION clause: Not used
- Best for: New data, Databricks-only workloads, simplest governance

External:
- Storage path: Customer-specified existing path
- DROP behavior: Data remains in cloud storage
- LOCATION clause: Required
- Best for: Existing data, multi-system access, data that must outlive the Databricks registration

DOCUMENTATION REFERENCES

- Volumes overview: https://docs.databricks.com/aws/en/volumes
- Managed tables: https://docs.databricks.com/aws/en/tables/managed
- External tables: https://docs.databricks.com/aws/en/tables/external
- Managed storage locations: https://docs.databricks.com/aws/en/connect/unity-catalog/managed-storage

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.

View solution in original post

2 REPLIES 2

themahesh
New Contributor II

Managed and external volumes may look the same because both store data in the customerโ€™s S3 and use customer IAM roles. However, the real difference is who controls the data folder.

With a managed volume, Databricks creates the folder in S3 and controls it. If the volume is deleted, Databricks also deletes the data.

With an external volume, the folder already belongs to the customer. Databricks can read and write to it, but if the volume is deleted, the data stays in S3.

Simply: the data lives in the same place, but the ownership is different. Managed volumes are controlled by Databricks; external volumes are controlled by the customer.

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @APJESK,

You are right that the setup steps look similar on the surface, but the differences between managed and external volumes (and tables) are meaningful once you understand what Unity Catalog does with the data after creation.

WHAT "MANAGED" ACTUALLY MEANS

The word "managed" refers to lifecycle management, not physical location. In both cases the data lives in customer-owned cloud storage. The distinction is about who controls the directory structure and what happens when you drop the object.

MANAGED VOLUMES AND TABLES

- Unity Catalog chooses the storage path automatically. The data is placed under the managed storage location configured at the schema, catalog, or metastore level (in that priority order). The path includes a hashed subdirectory like __unitystorage/schemas/<UUID> that Unity Catalog controls.
- You do NOT specify a LOCATION when creating them.
- When you DROP a managed volume or table, Databricks deletes the underlying data files from cloud storage after a retention period (7 days for volumes, 8 days for tables). You can UNDROP within that window.
- Predictive optimization (automatic OPTIMIZE, VACUUM, ANALYZE) is available for managed tables.

Creation example for a managed volume:

CREATE VOLUME my_catalog.my_schema.my_managed_volume
    COMMENT 'Managed volume, no LOCATION needed';

Creation example for a managed table:

CREATE TABLE my_catalog.my_schema.my_managed_table (
    id BIGINT,
    name STRING,
    created_at TIMESTAMP
);

EXTERNAL VOLUMES AND TABLES

- You specify an existing cloud storage path using the LOCATION clause.
- That path must fall within a registered external location (which is backed by a storage credential).
- When you DROP an external volume or table, Unity Catalog removes only the metadata registration. The data files remain in cloud storage untouched. You must manually delete them if needed.
- External tools and systems can read/write to the same cloud path directly, outside of Databricks.

Creation example for an external volume:

CREATE EXTERNAL VOLUME my_catalog.my_schema.my_external_volume
    COMMENT 'Points to existing S3 path'
    LOCATION 's3://my-bucket/my-existing-data';

Creation example for an external table:

CREATE TABLE my_catalog.my_schema.my_external_table (
    id BIGINT,
    name STRING,
    created_at TIMESTAMP
)
LOCATION 's3://my-bucket/my-table-data';

WHY BOTH NEED STORAGE CREDENTIALS AND EXTERNAL LOCATIONS

This is the part that causes the confusion you described. Even for managed storage, Unity Catalog needs a storage credential (IAM role) to access the S3 bucket and the managed storage location must be contained within an external location. The difference is:

- For managed objects, you configure the managed storage location once at the metastore, catalog, or schema level. After that, individual CREATE statements do not reference any path.
- For external objects, you explicitly provide the LOCATION on every CREATE statement, pointing to a path within a registered external location.

So the storage credential and external location setup is shared infrastructure, but managed objects let Unity Catalog decide the exact path and handle the full data lifecycle, while external objects let you point to a path you control.

QUICK COMPARISON

Managed:
- Storage path: Chosen by Unity Catalog automatically
- DROP behavior: Data deleted after retention period
- LOCATION clause: Not used
- Best for: New data, Databricks-only workloads, simplest governance

External:
- Storage path: Customer-specified existing path
- DROP behavior: Data remains in cloud storage
- LOCATION clause: Required
- Best for: Existing data, multi-system access, data that must outlive the Databricks registration

DOCUMENTATION REFERENCES

- Volumes overview: https://docs.databricks.com/aws/en/volumes
- Managed tables: https://docs.databricks.com/aws/en/tables/managed
- External tables: https://docs.databricks.com/aws/en/tables/external
- Managed storage locations: https://docs.databricks.com/aws/en/connect/unity-catalog/managed-storage

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.