When structuring your data lake with Databricks and implementing the medallion architecture (Bronze, Silver, Gold layers), it is essential to follow best practices for naming conventions and table organization to ensure optimal performance and usability, especially when using Unity Catalog.
Medallion Architecture Overview:
- Bronze Layer: This layer contains raw data ingested from various sources. The table structures in this layer should mirror the source system structures as closely as possible, including additional metadata columns for load date/time, process ID, etc.
- Silver Layer: This layer involves cleansing and conforming the data from the Bronze layer. The focus here is on creating an enterprise view of key business entities and transactions, making the data suitable for self-service analytics and further transformations.
- Gold Layer: This layer contains curated, business-level tables optimized for reporting and analytics. The data models here are typically denormalized and read-optimized.
you can consider the following approach:
- Materialized View: Convert the Gold table to a materialized view instead of a streaming table. This approach will allow you to handle updates and deletes in the Silver table without causing errors in the Gold table. Materialized views do not have the append-only restriction that streaming tables do.
- Full Refresh: If you expect only a one-time change in the Silver table and subsequent operations will be append-only, you can perform a full refresh of the Gold table. This will clear all data from the Gold table and reload all data from the Silver table, resolving the issue caused by the non-append change.
- Change Data Feed (CDF): Utilize the Delta Lake Change Data Feed (CDF) feature to track changes in the Silver table and apply those changes to the Gold table. This approach allows you to capture updates and deletes and propagate them to the Gold table efficiently
Naming Conventions:
- Descriptive Names: Use descriptive names for tables that clearly indicate their purpose and content. Avoid using prefixes or suffixes that might clutter the table names.
- Layer Indicators: While the documentation suggests avoiding prefixes or suffixes, you can use schema or catalog levels to indicate the layer (e.g., bronze, silver, gold).
- Hierarchical Structure: Organize your tables within schemas that reflect their layer and purpose. For example:
- bronze.marketing_campaign_response_us_cdc
- silver.marketing_campaign_response_us_cdc
- gold.marketing_campaign_response_us_cdc