cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Naming question about SQL server database schemas

dbernstein_tp
New Contributor III

I have an MS SQL server database that has several schemas we need to ingest data from. Call them "SCHEMA1" tables and "SCHEMA2" tables. Let's call the server S and the database D. In unity catalog I have a catalog called "staging" where the staging (bronze) data from all sources ends up in.

In staging I have a schema called "S_D", which is fine. However to distinguish the tables from the two different source schemas I end up calling them "SCHEMA1_TableName", "SCHEMA2_TableName", etc. That is, I use underscores to distinguish the two source schemas.

Is there a better way to handle this? Or is there a best practice? I would rather not have a separate catalog for each staging data source.

1 ACCEPTED SOLUTION

Accepted Solutions

K_Anudeep
Databricks Employee
Databricks Employee

Hello @dbernstein_tp ,

Generally, UC has a three-level naming convention: catalogue.schema.table, and currently, you are using something like staging.S_D.SCHEMA1_TableName

In my opinion, the right way to name it is to use staging.S_D_SCHEMA1.TableName, to represent the source schema instead of including it in the table name and also because it matches the medallion convention <env>_<zone>.<source_system>_<db>_<schema>.<table>

Therefore, as a best practice, I would suggest maintaining a single staging catalogue for bronze, but using one UC schema per source schema (or per source system). Avoid encoding the schema name into the table name when you can encode it into the UC schema level instead.
 

 
Anudeep

View solution in original post

4 REPLIES 4

K_Anudeep
Databricks Employee
Databricks Employee

Hello @dbernstein_tp ,

Generally, UC has a three-level naming convention: catalogue.schema.table, and currently, you are using something like staging.S_D.SCHEMA1_TableName

In my opinion, the right way to name it is to use staging.S_D_SCHEMA1.TableName, to represent the source schema instead of including it in the table name and also because it matches the medallion convention <env>_<zone>.<source_system>_<db>_<schema>.<table>

Therefore, as a best practice, I would suggest maintaining a single staging catalogue for bronze, but using one UC schema per source schema (or per source system). Avoid encoding the schema name into the table name when you can encode it into the UC schema level instead.
 

 
Anudeep

Coffee77
Contributor III

Create indeed a different catalog for environment (DEV, QA, Staging, Prod), new schema/databases inside and then tables, views, volumes or functions following the three names convention <catalog_name>.<schema_name>.<table_name>. Remove, of course, the schema name from the tables or any other child data object. You can query system tables to list all objects per schema and catalog if needed.

 


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

Raman_Unifeye
Contributor III

@dbernstein_tp - I would go with @K_Anudeep  suggestion. This way you would always know upfront and explicit about source database mapping.

However, persoanlly, I will have one catalog per source system database.


RG #Driving Business Outcomes with Data Intelligence

dbernstein_tp
New Contributor III

Thanks for the responses! @K_Anudeep suggestion makes sense in the context of our current lakehouse architecture so I think I will migrate to that.