cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Confusion about Data storage: Data Asset within Databricks vs Hive Metastore vs Delta Lake vs Lakehouse vs DBFS vs Unity Catalogue vs Azure Blob

Oliver_Angelil
Valued Contributor II

Hi there

It seems there are many different ways to store / manage data in Databricks.

This is the Data asset in Databricks:

Screenshot 2023-05-09 at 17.02.04 

However data can also be stored (hyperlinks included to relevant pages):

How can a new user make sense of all of these options and know how to proceed? What is the Data asset in Databricks (see screenshot) and how does it relate to the other options I have listed. Looking for the recommended way of data storage and access.

Thank you very much in advance.

1 ACCEPTED SOLUTION

Accepted Solutions

@Oliver Angelilโ€‹ one more thing we need to consider before opting managed or external is your gold layer consumers, if you have BI env most of things will be external based on need. Also we need to see type of tables that are present in yiu current existing hive metastore which is legacy

View solution in original post

9 REPLIES 9

karthik_p
Esteemed Contributor

@Oliver Angelilโ€‹ Usually Data in databricks is stored in either managed /external tables.

managed tables store metadata and data related meta data in same location which is hive_metastore (DBFS) that is part of your root storage configured during databricks configuration.

where as external tables, table meta data stores in hive_metastore and data gets store in external storage (any external storage s3/azure blob, gcs) that you will be mounting

As Governance came into picture which is unity catalog, DBFS is not recommended due to security reasons.

you can create your own metastore and link to databricks account--> link to databricks workspace--> catalog --> table data and meta data

if you already have any data in hive metastore, you can migrate them into data bricks unity catalog (which is newly created your own metastore)

there are few limitations in UC, if you are preferring to go with UC which is best option __> databricks recommends managed tables

Thanks @karthik pโ€‹ 

So the recommend approach would be Unity Catalogue with managed tables?

I.e. to not use hive_metastore (DBFS)?

karthik_p
Esteemed Contributor

@Oliver Angelilโ€‹ once you enable unity catalog you can use newly created unity catalog metastore, no need of DBFS hive_metastore which is legacy. also databricks recommends to go with external locations not mounts as mounts on not much secured

Thanks @karthik pโ€‹ - what do you mean by external locations? For example I see that one of my names in hive_metastore is EXTERNAL (it is coming from dbfs). What is an external table and what would an internal table be?image

karthik_p
Esteemed Contributor

@Oliver Angelilโ€‹ There is no concept call internal table, we have 2 types 1. managed 2. external , if you provide any external mount location in legacy it used to be external table . now onwards when you use unity catalog table type will be external , but place to store external table is called external location . In azure that should be ADFS 2 and type of config that you perform will be different to legacy .

legacy:

  1. you should mount by using access key
  2. you can see that under /dbfs

unity catalog:

  1. mounts are not recommended
  2. within Data Screen --> you can add external location and credential related to that
  3. who ever have access to that they can access data present in that external location

@karthik pโ€‹ I believe the recommended option would be managed tables, not external? See at 17:00 minutes in this video: https://youtu.be/ibvG-pYKl8U?t=1021

@Oliver Angelilโ€‹ yes recommend is managed that is because what ever new features released for tables that will be supported for managed, where s for external few limitations are there. Above notes shared are information purpose

@Oliver Angelilโ€‹ one more thing we need to consider before opting managed or external is your gold layer consumers, if you have BI env most of things will be external based on need. Also we need to see type of tables that are present in yiu current existing hive metastore which is legacy

Rahul_S
New Contributor II

Informative.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group