It will describe about Data models and how to build them on Databricks Platform, especially will Data vault and Data Mesh.
Data Vault
What is Data vault?
Data Vault is a modern data modeling technique designed for agile, scalable, and auditable enterprise data warehouses. It separates core business concepts, their relationships, and descriptive attributes into distinct components.
It is a "write-optimized" modeling style.

- Key Components:
- Hub – Stores unique business keys (e.g., CustomerID)
- Link – Stores relationships between hubs (e.g., Customer ↔ Order)
- Satellite – Stores descriptive attributes and history (e.g., Customer Name, Status)
- Point-In-Time (PIT) - Some satellites and hubs are pre-joined and provide some WHERE conditions with "point in time" filtering. Bridge tables (PIT) pre-joins hubs or entities to provide a flattened "dimensional table" like views for Entities.
- Raw Vault – Data is modeled as Hubs, Links and Satellite tables in the Raw Data Vault
- Business vault – Created by applying All the ETL business rules, data quality rules, cleansing and conforming rules. It can serve as an enterprise "central repository" of standardized cleansed data.
Why Use it?
- Scalability: Hubs are unlikely to change so it adds the stability and Satellite the flexibility as it can be extended easily for Hub Keys, which makes the model Scalable & audit friendly.
- Parallelism: High Level of parallelism can be achieved as there is less dependency between the tables of the model. For example, hubs or satellites for the customer, product, order could all be loaded in parallel.
- Canonical: Uses the source metadata in Raw Vault to preserve the single version of Facts.
- Adaptability: New Hubs or Satellites could be easily added to the model incrementally without massive refactoring of ETL.
- Databricks Alignment: DV approaches aligns with Lakehouse approach.


Design Consideration for Databricks:
- Key Management: Building the Business Key for HUBs can be done by hashing algorithms.
- Staging Layer: DV needs building landing/ staging zone that would act as Bronze layer of medallion Architecture.
- Building Medallion Architecture for Raw & Business Vault with PIT Tables: Creation of Hub, Satellite and Link tables can be done in Silver layer. Additionally applying the business rules and creating the Business vault can be served as Central Repository of Data platform.
- DLT Alignment:
- Parallel Processing for Hub and Sat Tables
- Sat Tables load can be configured for any change.
- Sequential Dependency for Links and BV.
- Apply DQ checks to build BV layer
- PIT table creation can be orchestrated by MV.

Implementation of DV on Databricks:
- Bronze Layer: Bronze Layer is created from ingested data. It may require to process/flatten complex structured data
- Raw Vault: It requires to create Hub, Sat and Link tables. It will be Single Source of Truth for Data platform on Databricks.
- The key columns used for this layer would be created by applying hashing function. For example - sha1(concat(UPPER(TRIM(c_name)),UPPER(TRIM(c_address)),UPPER(TRIM(c_phone)),UPPER(TRIM(c_mktsegment)))) as hash_diff
- Add Timestamp Columns for Auditing purpose.
- DQ Rules on top of Raw Vault tables can be added to populate the Business Vault by using Expect from DLT.
- Business Vault: Apply the business/ transformation logic on top of Raw layer.
- Data Mart by PIT: This layer would be used for consumption directly and refreshing with incremental data set or building point in time views for latest data sets can be done my building MV on top of DLT.
- This would act as typical denormalized table of Dimension modelling.
- If complex join logics to be built for consumption, it can be pre-build and stored in Delta table for later consumption.
- DLT Pipeline Setup:
- Setup Orchestration for parallel/ Sequential processing.
- Set re-start ability to handle the failures

- Optimization Techniques:
- Use Delta Formatted tables for Raw Vault, Business Vault and Gold layer tables.
- Make sure to use OPTIMIZE and Z-order indexes on all join keys of Hubs, Links and Satellites.
- Do not over partition the tables -especially the smaller satellites tables. Use Bloom filter indexing on Date columns, current flag columns and predicate columns that are typically filtered on to ensure best performance - especially if you need to create additional indices apart from Z-order.
- Delta Live Tables (Materialized Views) makes creating and managing PIT tables very easy.
- Reduce the optimize.maxFileSize to a lower number, such as 32-64MB vs. the default of 1 GB. By creating smaller files, you can benefit from file pruning and minimize the I/O retrieving the data you need to join.
- Data Vault model has comparatively more joins, so use the latest version of DBR which ensures that the Adaptive Query Execution is ON by default so that the best Join strategy is automatically used. Use Join hints only if necessary. ( for advanced performance tuning).
Tech Layer Overview & Tech Segmentation:
Feature | Raw Vault | Business Vault |
Contains | Hubs, Links, Satellites | PIT, Bridge, Derived views |
Tech | Delta Live Tables (DLT) | Views/MVs over DLT |
Purpose | Historical, auditable structure | Business-friendly, performant querying |
Access | Restricted via Unity Catalog | Shared to consumers |
Change Rate | Slowly changing | Frequently updated (PITs) |
Storage | Immutable | Derived, read-optimized |
Layer | Sub-Layer | Description | Technology |
Bronze | Landing Zone | Raw ingestion | Autoloader + DLT |
Silver | Staging Zone | Cleansed, deduped data | DLT |
Gold | Raw Vault | Hubs, Links, Satellites | DLT |
| Business Vault | PIT, Bridge, business rules | Views/MVs on DLT |
Consumer | Access Layer | Denormalized, analytics-friendly | Unity Catalog Views |
Storage | Immutable | Derived, read-optimized | |
Data Mesh:
What is Data Mesh:
It's a democratized approach to managing data where various domains operationalize their data, relieving the Central Data/Analytics team from designing and developing data products. Instead, Central teams focus on providing and governing Data resources using a self-service platform.

- Data Mesh Principles:
Principle | Description |
Domain-oriented decentralized data ownership and architecture | So that the ecosystem creating and consuming data can scale out as the number of sources of data, number of use cases, and diversity of access models to the data increases; simply increase the autonomous nodes on the mesh. |
Data as a product | So that data users can easily discover, understand and securely use high quality data with a delightful experience; data that is distributed across many domains. |
Self-serve data infrastructure as a platform | So that the domain teams can create and consume data products autonomously using the platform abstractions, hiding the complexity of building, executing and maintaining secure and interoperable data products. |
Federated computational governance | So that data users can get value from aggregation and correlation of independent data products - the mesh is behaving as an ecosystem following global interoperability standards; standards that are baked computationally into the platform. |
- Data Mesh Architecture Pattern: It is an architecture pattern where each functional data domain is represented as nodes and is interconnected, managed, and governed by a centralized IT/Governance node. Each data domain can host multiple data products that can be shared across different data domains using the same centralized IT/governance mode.

Data Product:
The domain team develops and exposes data products that provide access to the domain’s data in a consistent and consumable way.
A Data Product facilitates an end goal through the use of data.
Its objective is to provide this data in a clean, standardized, and proper way as a product to the other domain teams

- Build Data Mesh on Databricks:
Databricks provides many features, including data ingestion, data transformation, SQL, AI/ML and many more, making it a complete unified data platform. It takes away complexity involved with multiple tools/services and interoperability between them. This unified platform nature of Databricks makes it an ideal platform to implement Data Mesh architecture that demands heterogenous data types, use cases and data delivery methods. Data Mesh principles can be aligned to the design on Databricks.
Self Service Data Layer:
- As part of Data Mesh Architecture data can be fetched directly from its domain as its serve as a Data Product. Internal/ External users/systems/apps can fetch the data directly which is published by Domain itself.
- Proper Governance model would require to restrict the access.
- Access control can be segregated for other domain/ internal/ external users.

- As shown, For external apps and users, it is advisable to follow industry-recognized 3-layer architecture to isolate the front, back, and database tiers into different networks. Microservice-based architecture is recommended for better control and reusability. All microservices about a service functionality, e.g., creating a workspace or repo, can be written using Databricks APIs, and leveraged from Web Tier.
- Only Web Tier can be accessed from the public internet. API and DB Tier are isolated from the internet
- Web Tier: Front end for the self-service portal, accessed from the internet.
- API Tier: Backbone of the self-service portal. Hosts microservices for different Databricks APIs
- Metadata DB: Lean metadata layer to store functional details of a data domain and data products. Also, it stores particulars about the accessibility of a data product and consumers.
- Central Analytics – Data Denormalization:
- After building the Hub Layer, which will have the data from different domains. This data would be mostly Dimension/ Metadata specific to single domain.
- In order to build the Central Analytics system, it would require to ingest the ops data separately.
- This layer can be created separately on top of the Hub.
- A separate schema/ catalog and Workspace can be created for this processing. This would be typically arranged in denormalized way by have Star/ Snowflake Schema on top of Dimension model.
- Creating a separate catalog might be require if this analytical layer would server at Enterprise level. Setting up the separate pipeline (DLT in batch mode or scheduling the job from pyspark notebook would serve the purpose here).
- Creating the schema would be sufficient, if it just requires building another denormalized view of data from Domains and ingested transactional data set.
- It would require implementing the typical approach of building the SCD Type 2 on few domain data sets., which can be configured if DLT pipeline is used or built in pyspark notebook.