Databricks Community

SteveC527 · ‎12-16-2024

I am in the process of rebuilding the data lake at my current company with databricks and I'm struggling to find comprehensive best practices for naming conventions and structuring medallion architecture to work optimally with the Databricks assistant.

I've been reading about the assistant and what sources it uses to determine what fields it should use in what table etc. Most of the examples I have read shows descriptive table names but without any prefixes or suffixes. The problem is I usually just organize medallion architecture, as well as other things like residency and ingestion source, either using prefixes or suffixes in the table names. For example, bronze_marketing_campaign_response_us_cdc . The documentation I am reading makes it seem like this is not going to be very optimal but I can't seem to find what the 'right' way actually is. Does all of the other information need to happen at the catalog or schema level? Is there something I can do in Unity Catalog to set this up so the assistant can interpret the other information in the table names?

Alberto_Umana · ‎12-16-2024

When structuring your data lake with Databricks and implementing the medallion architecture (Bronze, Silver, Gold layers), it is essential to follow best practices for naming conventions and table organization to ensure optimal performance and usability, especially when using Unity Catalog.

Medallion Architecture Overview:

Bronze Layer: This layer contains raw data ingested from various sources. The table structures in this layer should mirror the source system structures as closely as possible, including additional metadata columns for load date/time, process ID, etc.
Silver Layer: This layer involves cleansing and conforming the data from the Bronze layer. The focus here is on creating an enterprise view of key business entities and transactions, making the data suitable for self-service analytics and further transformations.
Gold Layer: This layer contains curated, business-level tables optimized for reporting and analytics. The data models here are typically denormalized and read-optimized.

you can consider the following approach:

Materialized View: Convert the Gold table to a materialized view instead of a streaming table. This approach will allow you to handle updates and deletes in the Silver table without causing errors in the Gold table. Materialized views do not have the append-only restriction that streaming tables do.
Full Refresh: If you expect only a one-time change in the Silver table and subsequent operations will be append-only, you can perform a full refresh of the Gold table. This will clear all data from the Gold table and reload all data from the Silver table, resolving the issue caused by the non-append change.
Change Data Feed (CDF): Utilize the Delta Lake Change Data Feed (CDF) feature to track changes in the Silver table and apply those changes to the Gold table. This approach allows you to capture updates and deletes and propagate them to the Gold table efficiently

Naming Conventions:

Descriptive Names: Use descriptive names for tables that clearly indicate their purpose and content. Avoid using prefixes or suffixes that might clutter the table names.
Layer Indicators: While the documentation suggests avoiding prefixes or suffixes, you can use schema or catalog levels to indicate the layer (e.g., bronze, silver, gold).
Hierarchical Structure: Organize your tables within schemas that reflect their layer and purpose. For example:

bronze.marketing_campaign_response_us_cdc
silver.marketing_campaign_response_us_cdc
gold.marketing_campaign_response_us_cdc

Alberto_Umana · ‎12-16-2024

Unity Catalog Setup:

Catalog and Schema Levels: Use Unity Catalog to manage and organize your tables. Create separate catalogs or schemas for each layer of the medallion architecture. This way, the assistant can interpret the context based on the catalog or schema rather than relying on table name prefixes or suffixes.
Access Control: Unity Catalog provides centralized access control, auditing, and data discovery capabilities. Ensure that you define access policies at the catalog or schema level to manage permissions effectively.

Example Structure:

Catalogs:

bronze_catalog
silver_catalog
gold_catalog

Schemas within Catalogs:

bronze_catalog.marketing
silver_catalog.marketing
gold_catalog.marketing

Tables:

bronze_catalog.marketing.campaign_response_us_cdc
silver_catalog.marketing.campaign_response_us_cdc
gold_catalog.marketing.campaign_response_us_cdc

dataBuilder · ‎02-07-2025

Hello!

I am in a similar position and the medallion architecture makes a lot of sense to me (indeed, I believe we've been following a version of that ourselves for a long time).

It seems to me having separate catalogs for each layer (bronze/silver/gold) seems to make the most sense (per the second reply above). My question is: are there any drawbacks to organizing our data where each catalog is a different layer?

Long-term I'm thinking we're likely to ETL using Delta Live Tables so I'd be particularly interested in knowing if there might be any limitations with DLTs pipelines.

Rjdudley · a month ago

Our initial approach was to have catalogs for sources and uses, but that was confusing and the grade of data wasn't obvious. And, as we began to join data, it was unclear where to put the final datasets. We switched to three catalogs--one for bronze, one for silver, one for gold. This has made life so much less confusing for us, not to mention the systems which rely on our data.

In bronze we have a schema for each source. The landed data is in an external volume, and we flatten the data into tables in the same schema. Silver has schemas for different entities we're working with, like "person" or "product". Gold we have schemas for different consumers or departments, there isn't a lot of overlap in data. This way we can restrict end users to specific gold schemas.