cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unity Catalog Migration: External AWS S3 Location Tables vs. Managed Tables in Databricks!

Mantsama4
Contributor III

Hey Databricks enthusiasts!

Migrating to Unity Catalog? Understanding the difference between External S3 Location Tables and Managed Tables is crucial for optimizing governance, security, and cost efficiency.

🔹External S3 Location Tables

✔️Data remains in an existing S3 bucket, with Databricks referencing it externally.
✔️Unity Catalog tracks metadata, but does not control the data lifecycle.
✔️Ideal for multi-platform access or when organizations prefer to manage storage independently.
Challenges: Lacks full governance, lifecycle control, and performance optimizations offered by Databricks-managed storage.

🔹Managed Tables

✔️Data is fully managed by Databricks, stored within its managed storage.
✔️Unity Catalog controls both metadata and the physical data, ensuring strong governance, security, and lineage tracking.
✔️Best suited for AI/ML workloads, compliance-driven use cases, and automated data lifecycle management.
Considerations: Requires migrating data into Databricks-managed storage, impacting existing workflows.

Which approach works best for your use case? Let’s discuss the trade-offs and strategies for seamless Unity Catalog migration

Mantu S
4 ACCEPTED SOLUTIONS

Accepted Solutions

MariuszK
Contributor III

There are two use cases where it's worth using external tables:

  • Bronze Layer- when you use an external tool to ingest data into tables using file system.
  • Integration with external services that aren't able to integrate with UC and they need to read files from storage.

In other cases it's better to use manged tables, especially when you want to automate governance on them such as Liquid Clustering.

View solution in original post

Hi MariuszK,

I appreciate your note. We had a discussion with a few internal Databricks architects as well as a Databricks architect. Based on their recommendations, tables that are frequently accessed—such as Gold layer tables for reporting, tables used by ML jobs, and real-time streaming tables—should be created as managed tables. This approach ensures better performance, optimization, and enhanced governance and security controls, including support for serverless jobs. Thanks.

Mantu S

View solution in original post

Isi
New Contributor III

Hey!

I hope I’m not too late, and I’d like to share my opinion. While it’s true that managed services offer certain advantages over external tables, you should keep in mind that Databricks services often come with an associated cost, such as Predictive Optimization. I recommend reviewing your workflow and checking the associated costs here: Databricks Pricing.

It’s important to note that Databricks operates on a pay-as-you-go model, but in most cases, having control over the service and being able to manage resources through your cloud provider—for example, horizontal autoscaling, cluster size adjustments, etc.—often results in a lower overall bill. I recommend conducting a cost analysis to determine which processes could benefit from migrating to managed services and which ones might not be worth it.

In general, managed services provide better performance, optimization, and enhanced governance and security controls, including support for serverless jobs, but everything comes at a much higher cost.

🙂

View solution in original post

Mantsama4
Contributor III

Thank you for sharing your insights! You make a great point about the cost considerations associated with managed services in Databricks. While managed tables offer advantages in terms of performance, optimization, governance, and security, it’s always important to evaluate cost implications based on specific workloads.

A cost-benefit analysis can help determine which processes truly benefit from managed services versus those that can be optimized through cloud provider resource management (e.g., horizontal autoscaling, cluster size adjustments). We’ll take your feedback into account and ensure the right balance between cost efficiency and operational benefits.

Appreciate your input!

Mantu S

View solution in original post

4 REPLIES 4

MariuszK
Contributor III

There are two use cases where it's worth using external tables:

  • Bronze Layer- when you use an external tool to ingest data into tables using file system.
  • Integration with external services that aren't able to integrate with UC and they need to read files from storage.

In other cases it's better to use manged tables, especially when you want to automate governance on them such as Liquid Clustering.

Hi MariuszK,

I appreciate your note. We had a discussion with a few internal Databricks architects as well as a Databricks architect. Based on their recommendations, tables that are frequently accessed—such as Gold layer tables for reporting, tables used by ML jobs, and real-time streaming tables—should be created as managed tables. This approach ensures better performance, optimization, and enhanced governance and security controls, including support for serverless jobs. Thanks.

Mantu S

Isi
New Contributor III

Hey!

I hope I’m not too late, and I’d like to share my opinion. While it’s true that managed services offer certain advantages over external tables, you should keep in mind that Databricks services often come with an associated cost, such as Predictive Optimization. I recommend reviewing your workflow and checking the associated costs here: Databricks Pricing.

It’s important to note that Databricks operates on a pay-as-you-go model, but in most cases, having control over the service and being able to manage resources through your cloud provider—for example, horizontal autoscaling, cluster size adjustments, etc.—often results in a lower overall bill. I recommend conducting a cost analysis to determine which processes could benefit from migrating to managed services and which ones might not be worth it.

In general, managed services provide better performance, optimization, and enhanced governance and security controls, including support for serverless jobs, but everything comes at a much higher cost.

🙂

Mantsama4
Contributor III

Thank you for sharing your insights! You make a great point about the cost considerations associated with managed services in Databricks. While managed tables offer advantages in terms of performance, optimization, governance, and security, it’s always important to evaluate cost implications based on specific workloads.

A cost-benefit analysis can help determine which processes truly benefit from managed services versus those that can be optimized through cloud provider resource management (e.g., horizontal autoscaling, cluster size adjustments). We’ll take your feedback into account and ensure the right balance between cost efficiency and operational benefits.

Appreciate your input!

Mantu S

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group