cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Confusion about Data storage: Data Asset within Databricks vs Hive Metastore vs Delta Lake vs Lakehouse vs DBFS vs Unity Catalogue vs Azure Blob

Oliver_Angelil
Valued Contributor II

Hi there

It seems there are many different ways to store / manage data in Databricks.

This is the Data asset in Databricks:

Screenshot 2023-05-09 at 17.02.04 

However data can also be stored (hyperlinks included to relevant pages):

How can a new user make sense of all of these options and know how to proceed? What is the Data asset in Databricks (see screenshot) and how does it relate to the other options I have listed. Looking for the recommended way of data storage and access.

Thank you very much in advance.

1 ACCEPTED SOLUTION

Accepted Solutions

@Oliver Angelil​ one more thing we need to consider before opting managed or external is your gold layer consumers, if you have BI env most of things will be external based on need. Also we need to see type of tables that are present in yiu current existing hive metastore which is legacy

View solution in original post

8 REPLIES 8

karthik_p
Esteemed Contributor

@Oliver Angelil​ Usually Data in databricks is stored in either managed /external tables.

managed tables store metadata and data related meta data in same location which is hive_metastore (DBFS) that is part of your root storage configured during databricks configuration.

where as external tables, table meta data stores in hive_metastore and data gets store in external storage (any external storage s3/azure blob, gcs) that you will be mounting

As Governance came into picture which is unity catalog, DBFS is not recommended due to security reasons.

you can create your own metastore and link to databricks account--> link to databricks workspace--> catalog --> table data and meta data

if you already have any data in hive metastore, you can migrate them into data bricks unity catalog (which is newly created your own metastore)

there are few limitations in UC, if you are preferring to go with UC which is best option __> databricks recommends managed tables

Thanks @karthik p​ 

So the recommend approach would be Unity Catalogue with managed tables?

I.e. to not use hive_metastore (DBFS)?

karthik_p
Esteemed Contributor

@Oliver Angelil​ once you enable unity catalog you can use newly created unity catalog metastore, no need of DBFS hive_metastore which is legacy. also databricks recommends to go with external locations not mounts as mounts on not much secured

Thanks @karthik p​ - what do you mean by external locations? For example I see that one of my names in hive_metastore is EXTERNAL (it is coming from dbfs). What is an external table and what would an internal table be?image

karthik_p
Esteemed Contributor

@Oliver Angelil​ There is no concept call internal table, we have 2 types 1. managed 2. external , if you provide any external mount location in legacy it used to be external table . now onwards when you use unity catalog table type will be external , but place to store external table is called external location . In azure that should be ADFS 2 and type of config that you perform will be different to legacy .

legacy:

  1. you should mount by using access key
  2. you can see that under /dbfs

unity catalog:

  1. mounts are not recommended
  2. within Data Screen --> you can add external location and credential related to that
  3. who ever have access to that they can access data present in that external location

@karthik p​ I believe the recommended option would be managed tables, not external? See at 17:00 minutes in this video: https://youtu.be/ibvG-pYKl8U?t=1021

@Oliver Angelil​ yes recommend is managed that is because what ever new features released for tables that will be supported for managed, where s for external few limitations are there. Above notes shared are information purpose

@Oliver Angelil​ one more thing we need to consider before opting managed or external is your gold layer consumers, if you have BI env most of things will be external based on need. Also we need to see type of tables that are present in yiu current existing hive metastore which is legacy

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.