cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

Building Data Models on Databricks Platform

rathorer
New Contributor II

It will describe about Data models and how to build them on Databricks Platform, especially will Data vault and Data Mesh.

Data Vault

What is Data vault?

Data Vault is a modern data modeling technique designed for agile, scalable, and auditable enterprise data warehouses. It separates core business concepts, their relationships, and descriptive attributes into distinct components.
It is a "write-optimized" modeling style.

rathorer_0-1753270638873.png

  • Key Components:
    • Hub – Stores unique business keys (e.g., CustomerID)
    • Link – Stores relationships between hubs (e.g., Customer ↔ Order)
    • Satellite – Stores descriptive attributes and history (e.g., Customer Name, Status)
    • Point-In-Time (PIT) - Some satellites and hubs are pre-joined and provide some WHERE conditions with "point in time" filtering. Bridge tables (PIT) pre-joins hubs or entities to provide a flattened "dimensional table" like views for Entities.
    • Raw Vault – Data is modeled as Hubs, Links and Satellite tables in the Raw Data Vault
    • Business vault – Created by applying All the ETL business rules, data quality rules, cleansing and conforming rules. It can serve as an enterprise "central repository" of standardized cleansed data.
  • Why Use it?

    • Scalability: Hubs are unlikely to change so it adds the stability and Satellite the flexibility as it can be extended easily for Hub Keys, which makes the model Scalable & audit friendly.
    • Parallelism: High Level of parallelism can be achieved as there is less dependency between the tables of the model. For example, hubs or satellites for the customer, product, order could all be loaded in parallel.
    • Canonical: Uses the source metadata in Raw Vault to preserve the single version of Facts.
    • Adaptability: New Hubs or Satellites could be easily added to the model incrementally without massive refactoring of ETL.
    • Databricks Alignment: DV approaches aligns with Lakehouse approach.rathorer_15-1753272761286.png

       

      rathorer_16-1753272771387.png

       

  • Design Consideration for Databricks:

    • Key Management: Building the Business Key for HUBs can be done by hashing algorithms.
    • Staging Layer: DV needs building landing/ staging zone that would act as Bronze layer of medallion Architecture.
    • Building Medallion Architecture for Raw & Business Vault with PIT Tables: Creation of Hub, Satellite and Link tables can be done in Silver layer. Additionally applying the business rules and creating the Business vault can be served as Central Repository of Data platform.
    • DLT Alignment:
      • Parallel Processing for Hub and Sat Tables
      • Sat Tables load can be configured for any change.
      • Sequential Dependency for Links and BV.
      • Apply DQ checks to build BV layer
      • PIT table creation can be orchestrated by MV.
    • rathorer_18-1753272935114.png

       

  • Implementation of DV on Databricks:

    • Bronze Layer: Bronze Layer is created from ingested data. It may require to process/flatten complex structured data
    • Raw Vault: It requires to create Hub, Sat and Link tables. It will be Single Source of Truth for Data platform on Databricks.
      • The key columns used for this layer would be created by applying hashing function. For example - sha1(concat(UPPER(TRIM(c_name)),UPPER(TRIM(c_address)),UPPER(TRIM(c_phone)),UPPER(TRIM(c_mktsegment)))) as hash_diff
      • Add Timestamp Columns for Auditing purpose.
      • DQ Rules on top of Raw Vault tables can be added to populate the Business Vault by using Expect from DLT.
    • Business Vault: Apply the business/ transformation logic on top of Raw layer.
    • Data Mart by PIT: This layer would be used for consumption directly and refreshing with incremental data set or building point in time views for latest data sets can be done my building MV on top of DLT.
      • This would act as typical denormalized table of Dimension modelling.
      • If complex join logics to be built for consumption, it can be pre-build and stored in Delta table for later consumption.
    • DLT Pipeline Setup
      • Setup Orchestration for parallel/ Sequential processing.
      • Set re-start ability to handle the failures
    • rathorer_0-1753273211716.png

       

  • Optimization Techniques:
    • Use Delta Formatted tables for Raw Vault, Business Vault and Gold layer tables.
    • Make sure to use OPTIMIZE and Z-order indexes on all join keys of Hubs, Links and Satellites.
    • Do not over partition the tables -especially the smaller satellites tables. Use Bloom filter indexing on Date columns, current flag columns and predicate columns that are typically filtered on to ensure best performance - especially if you need to create additional indices apart from Z-order.
    • Delta Live Tables (Materialized Views) makes creating and managing PIT tables very easy.
    • Reduce the optimize.maxFileSize to a lower number, such as 32-64MB vs. the default of 1 GB. By creating smaller files, you can benefit from file pruning and minimize the I/O retrieving the data you need to join.
    • Data Vault model has comparatively more joins, so use the latest version of DBR which ensures that the Adaptive Query Execution is ON by default so that the best Join strategy is automatically used. Use Join hints only if necessary. ( for advanced performance tuning).
  • Tech Layer Overview & Tech Segmentation:

    Feature

    Raw Vault

    Business Vault

    Contains

    Hubs, Links, Satellites

    PIT, Bridge, Derived views

    Tech

    Delta Live Tables (DLT)

    Views/MVs over DLT

    Purpose

    Historical, auditable structure

    Business-friendly, performant querying

    Access

    Restricted via Unity Catalog

    Shared to consumers

    Change Rate

    Slowly changing

    Frequently updated (PITs)

    Storage

    Immutable

    Derived, read-optimized

    Layer

    Sub-Layer

    Description

    Technology

    Bronze

    Landing Zone

    Raw ingestion

    Autoloader + DLT

    Silver

    Staging Zone

    Cleansed, deduped data

    DLT

    Gold

    Raw Vault

    Hubs, Links, Satellites

    DLT

     

    Business Vault

    PIT, Bridge, business rules

    Views/MVs on DLT

    Consumer

    Access Layer

    Denormalized, analytics-friendly

    Unity Catalog Views

    Storage

    Immutable

    Derived, read-optimized

     

     

Data Mesh:

What is Data Mesh:

It's a democratized approach to managing data where various domains operationalize their data, relieving the Central Data/Analytics team from designing and developing data products. Instead, Central teams focus on providing and governing Data resources using a self-service platform.

rathorer_1-1753273720055.png

 

  • Data Mesh Principles:

    Principle

    Description

    Domain-oriented decentralized data ownership and architecture

    So that the ecosystem creating and consuming data can scale out as the number of sources of data, number of use cases, and diversity of access models to the data increases; simply increase the autonomous nodes on the mesh.

    Data as a product

    So that data users can easily discover, understand and securely use high quality data with a delightful experience; data that is distributed across many domains.

    Self-serve data infrastructure as a platform

    So that the domain teams can create and consume data products autonomously using the platform abstractions, hiding the complexity of building, executing and maintaining secure and interoperable data products.

    Federated computational governance

    So that data users can get value from aggregation and correlation of independent data products - the mesh is behaving as an ecosystem following global interoperability standards; standards that are baked computationally into the platform.

  • Data Mesh Architecture Pattern: It is an architecture pattern where each functional data domain is represented as nodes and is interconnected, managed, and governed by a centralized IT/Governance node. Each data domain can host multiple data products that can be shared across different data domains using the same centralized IT/governance mode.
  • rathorer_0-1753283798836.png

     

    • Data Product

      The domain team develops and exposes data products that provide access to the domain’s data in a consistent and consumable way.

      A Data Product facilitates an end goal through the use of data.

      Its objective is to provide this data in a clean, standardized, and proper way as a product to the other domain teams

    • rathorer_1-1753283833115.png

       

  • Build Data Mesh on Databricks: 

    Databricks provides many features, including data ingestion, data transformation, SQL, AI/ML and many more, making it a complete unified data platform. It takes away complexity involved with multiple tools/services and interoperability between them. This unified platform nature of Databricks makes it an ideal platform to implement Data Mesh architecture that demands heterogenous data types, use cases and data delivery methods. Data Mesh principles can be aligned to the design on Databricks.

    • Data Domain & Product Platform Building:

      • A common framework can be built which can use to onboard various data domains.
        • This platform can be configurable for all data processing steps including ingestion, data cleansing, applying Transformation/ ETL logic at domain level.
        • DLT pipeline is one of the good Architectural decision as it provides the configurable approach for each step of data processing including setting up DQ rules.
        • Each Data Product can have its own catalog. DLT would be configured for all these inputs and can be scaled for all the Data Products.
      • Building the separate data domain can be separated by having separate Workspaces. Since Domain can have multiple product so catalog to WS binding would be helpful and access can be controlled within domain/ Product level.
      • Data Sharing between Domains and with Hub can be controlled by Delta Sharing.
      • rathorer_2-1753284068176.png

         

    • Building Centralized Hub/ Self Service Platform:

      • Data products are published to the data hub, which owns and manages a majority of assets registered in Unity Catalog.
      • Data products are published to the data hub, which owns and manages a majority of assets registered in Unity Catalog.
        rathorer_3-1753284162637.png

         

    • Federated Governance:

      • Data cataloging, lineage, audit, and access control via Unity Catalog.
      • Unity Catalog provides not only informational cataloging capabilities such as data discovery and lineage, but also the enforcement of fine-grained access controls and auditing.
        rathorer_4-1753284208418.png

         

  • Self Service Data Layer:

    • As part of Data Mesh Architecture data can be fetched directly from its domain as its serve as a Data Product. Internal/ External users/systems/apps can fetch the data directly which is published by Domain itself.
    • Proper Governance model would require to restrict the access.
    • Access control can be segregated for other domain/ internal/ external users.
      rathorer_0-1753284389056.png
    • As shown, For external apps and users, it is advisable to follow industry-recognized 3-layer architecture to isolate the front, back, and database tiers into different networks. Microservice-based architecture is recommended for better control and reusability. All microservices about a service functionality, e.g., creating a workspace or repo, can be written using Databricks APIs, and leveraged from Web Tier.
    • Only Web Tier can be accessed from the public internet. API and DB Tier are isolated from the internet
      • Web Tier: Front end for the self-service portal, accessed from the internet.
      • API Tier: Backbone of the self-service portal. Hosts microservices for different Databricks APIs
      • Metadata DB: Lean metadata layer to store functional details of a data domain and data products. Also, it stores particulars about the accessibility of a data product and consumers.

  • Central Analytics – Data Denormalization:
    • After building the Hub Layer, which will have the data from different domains. This data would be mostly Dimension/ Metadata specific to single domain.
    • In order to build the Central Analytics system, it would require to ingest the ops data separately.
    • This layer can be created separately on top of the Hub.
    • A separate schema/ catalog and Workspace can be created for this processing. This would be typically arranged in denormalized way by have Star/ Snowflake Schema on top of Dimension model.
      • Creating a separate catalog might be require if this analytical layer would server at Enterprise level. Setting up the separate pipeline (DLT in batch mode or scheduling the job from pyspark notebook would serve the purpose here).
      • Creating the schema would be sufficient, if it just requires building another denormalized view of data from Domains and ingested transactional data set.
    • It would require implementing the typical approach of building the SCD Type 2 on few domain data sets., which can be configured if DLT pipeline is used or built in pyspark notebook.
0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now