05-09-2023 08:21 AM
Hi there
It seems there are many different ways to store / manage data in Databricks.
This is the Data asset in Databricks:
However data can also be stored (hyperlinks included to relevant pages):
How can a new user make sense of all of these options and know how to proceed? What is the Data asset in Databricks (see screenshot) and how does it relate to the other options I have listed. Looking for the recommended way of data storage and access.
Thank you very much in advance.
05-19-2023 03:23 PM
@Oliver Angelil one more thing we need to consider before opting managed or external is your gold layer consumers, if you have BI env most of things will be external based on need. Also we need to see type of tables that are present in yiu current existing hive metastore which is legacy
05-09-2023 09:04 AM
@Oliver Angelil Usually Data in databricks is stored in either managed /external tables.
managed tables store metadata and data related meta data in same location which is hive_metastore (DBFS) that is part of your root storage configured during databricks configuration.
where as external tables, table meta data stores in hive_metastore and data gets store in external storage (any external storage s3/azure blob, gcs) that you will be mounting
As Governance came into picture which is unity catalog, DBFS is not recommended due to security reasons.
you can create your own metastore and link to databricks account--> link to databricks workspace--> catalog --> table data and meta data
if you already have any data in hive metastore, you can migrate them into data bricks unity catalog (which is newly created your own metastore)
there are few limitations in UC, if you are preferring to go with UC which is best option __> databricks recommends managed tables
05-09-2023 10:41 AM
Thanks @karthik p
So the recommend approach would be Unity Catalogue with managed tables?
I.e. to not use hive_metastore (DBFS)?
05-09-2023 11:52 AM
@Oliver Angelil once you enable unity catalog you can use newly created unity catalog metastore, no need of DBFS hive_metastore which is legacy. also databricks recommends to go with external locations not mounts as mounts on not much secured
05-10-2023 09:11 AM
05-11-2023 12:29 PM
@Oliver Angelil There is no concept call internal table, we have 2 types 1. managed 2. external , if you provide any external mount location in legacy it used to be external table . now onwards when you use unity catalog table type will be external , but place to store external table is called external location . In azure that should be ADFS 2 and type of config that you perform will be different to legacy .
legacy:
unity catalog:
05-19-2023 01:52 PM
@karthik p I believe the recommended option would be managed tables, not external? See at 17:00 minutes in this video: https://youtu.be/ibvG-pYKl8U?t=1021
05-19-2023 03:13 PM
@Oliver Angelil yes recommend is managed that is because what ever new features released for tables that will be supported for managed, where s for external few limitations are there. Above notes shared are information purpose
05-19-2023 03:23 PM
@Oliver Angelil one more thing we need to consider before opting managed or external is your gold layer consumers, if you have BI env most of things will be external based on need. Also we need to see type of tables that are present in yiu current existing hive metastore which is legacy
07-14-2024 12:33 AM
Informative.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group