cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Medallion Architecture and Databricks Assistant

SteveC527
New Contributor

I am in the process of rebuilding the data lake at my current company with databricks and I'm struggling to find comprehensive best practices for naming conventions and structuring medallion architecture to work optimally with the Databricks assistant.

I've been reading about the assistant and what sources it uses to determine what fields it should use in what table etc. Most of the examples I have read shows descriptive table names but without any prefixes or suffixes. The problem is I usually just organize medallion architecture, as well as other things like residency and ingestion source, either using prefixes or suffixes in the table names. For example, bronze_marketing_campaign_response_us_cdc . The documentation I am reading makes it seem like this is not going to be very optimal but I can't seem to find what the 'right' way actually is. Does all of the other information need to happen at the catalog or schema level? Is there something I can do in Unity Catalog to set this up so the assistant can interpret the other information in the table names? 

2 REPLIES 2

Alberto_Umana
Databricks Employee
Databricks Employee

When structuring your data lake with Databricks and implementing the medallion architecture (Bronze, Silver, Gold layers), it is essential to follow best practices for naming conventions and table organization to ensure optimal performance and usability, especially when using Unity Catalog.

Medallion Architecture Overview:

  1. Bronze Layer: This layer contains raw data ingested from various sources. The table structures in this layer should mirror the source system structures as closely as possible, including additional metadata columns for load date/time, process ID, etc.
  2. Silver Layer: This layer involves cleansing and conforming the data from the Bronze layer. The focus here is on creating an enterprise view of key business entities and transactions, making the data suitable for self-service analytics and further transformations.
  3. Gold Layer: This layer contains curated, business-level tables optimized for reporting and analytics. The data models here are typically denormalized and read-optimized.

 

 

you can consider the following approach:

  1. Materialized View: Convert the Gold table to a materialized view instead of a streaming table. This approach will allow you to handle updates and deletes in the Silver table without causing errors in the Gold table. Materialized views do not have the append-only restriction that streaming tables do.
  2. Full Refresh: If you expect only a one-time change in the Silver table and subsequent operations will be append-only, you can perform a full refresh of the Gold table. This will clear all data from the Gold table and reload all data from the Silver table, resolving the issue caused by the non-append change.
  3. Change Data Feed (CDF): Utilize the Delta Lake Change Data Feed (CDF) feature to track changes in the Silver table and apply those changes to the Gold table. This approach allows you to capture updates and deletes and propagate them to the Gold table efficiently

 

Naming Conventions:

  • Descriptive Names: Use descriptive names for tables that clearly indicate their purpose and content. Avoid using prefixes or suffixes that might clutter the table names.
  • Layer Indicators: While the documentation suggests avoiding prefixes or suffixes, you can use schema or catalog levels to indicate the layer (e.g., bronze, silver, gold).
  • Hierarchical Structure: Organize your tables within schemas that reflect their layer and purpose. For example:
    • bronze.marketing_campaign_response_us_cdc
    • silver.marketing_campaign_response_us_cdc
    • gold.marketing_campaign_response_us_cdc

Alberto_Umana
Databricks Employee
Databricks Employee

Unity Catalog Setup:

  • Catalog and Schema Levels: Use Unity Catalog to manage and organize your tables. Create separate catalogs or schemas for each layer of the medallion architecture. This way, the assistant can interpret the context based on the catalog or schema rather than relying on table name prefixes or suffixes.
  • Access Control: Unity Catalog provides centralized access control, auditing, and data discovery capabilities. Ensure that you define access policies at the catalog or schema level to manage permissions effectively.

Example Structure:

  • Catalogs:
    • bronze_catalog
    • silver_catalog
    • gold_catalog
  • Schemas within Catalogs:
    • bronze_catalog.marketing
    • silver_catalog.marketing
    • gold_catalog.marketing
  • Tables:
    • bronze_catalog.marketing.campaign_response_us_cdc
    • silver_catalog.marketing.campaign_response_us_cdc
    • gold_catalog.marketing.campaign_response_us_cdc

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group