Databricks Community

rathorer · ‎07-23-2025

It will describe about Data models and how to build them on Databricks Platform, especially will Data vault and Data Mesh.

Data Vault

What is Data vault?

Data Vault is a modern data modeling technique designed for agile, scalable, and auditable enterprise data warehouses. It separates core business concepts, their relationships, and descriptive attributes into distinct components.
It is a "write-optimized" modeling style.

Key Components:
- Hub – Stores unique business keys (e.g., CustomerID)
- Link – Stores relationships between hubs (e.g., Customer ↔ Order)
- Satellite – Stores descriptive attributes and history (e.g., Customer Name, Status)
- Point-In-Time (PIT) - Some satellites and hubs are pre-joined and provide some WHERE conditions with "point in time" filtering. Bridge tables (PIT) pre-joins hubs or entities to provide a flattened "dimensional table" like views for Entities.
- Raw Vault – Data is modeled as Hubs, Links and Satellite tables in the Raw Data Vault
- Business vault – Created by applying All the ETL business rules, data quality rules, cleansing and conforming rules. It can serve as an enterprise "central repository" of standardized cleansed data.
Why Use it?
- Scalability: Hubs are unlikely to change so it adds the stability and Satellite the flexibility as it can be extended easily for Hub Keys, which makes the model Scalable & audit friendly.
- Parallelism: High Level of parallelism can be achieved as there is less dependency between the tables of the model. For example, hubs or satellites for the customer, product, order could all be loaded in parallel.
- Canonical: Uses the source metadata in Raw Vault to preserve the single version of Facts.
- Adaptability: New Hubs or Satellites could be easily added to the model incrementally without massive refactoring of ETL.
- Databricks Alignment: DV approaches aligns with Lakehouse approach.
Design Consideration for Databricks:
- Key Management: Building the Business Key for HUBs can be done by hashing algorithms.
- Staging Layer: DV needs building landing/ staging zone that would act as Bronze layer of medallion Architecture.
- Building Medallion Architecture for Raw & Business Vault with PIT Tables: Creation of Hub, Satellite and Link tables can be done in Silver layer. Additionally applying the business rules and creating the Business vault can be served as Central Repository of Data platform.
- DLT Alignment:
  - Parallel Processing for Hub and Sat Tables
  - Sat Tables load can be configured for any change.
  - Sequential Dependency for Links and BV.
  - Apply DQ checks to build BV layer
  - PIT table creation can be orchestrated by MV.
Implementation of DV on Databricks:
- Bronze Layer: Bronze Layer is created from ingested data. It may require to process/flatten complex structured data
- Raw Vault: It requires to create Hub, Sat and Link tables. It will be Single Source of Truth for Data platform on Databricks.
  - The key columns used for this layer would be created by applying hashing function. For example - sha1(concat(UPPER(TRIM(c_name)),UPPER(TRIM(c_address)),UPPER(TRIM(c_phone)),UPPER(TRIM(c_mktsegment)))) as hash_diff
  - Add Timestamp Columns for Auditing purpose.
  - DQ Rules on top of Raw Vault tables can be added to populate the Business Vault by using Expect from DLT.
- Business Vault: Apply the business/ transformation logic on top of Raw layer.
- Data Mart by PIT: This layer would be used for consumption directly and refreshing with incremental data set or building point in time views for latest data sets can be done my building MV on top of DLT.
  - This would act as typical denormalized table of Dimension modelling.
  - If complex join logics to be built for consumption, it can be pre-build and stored in Delta table for later consumption.
- DLT Pipeline Setup:
  - Setup Orchestration for parallel/ Sequential processing.
  - Set re-start ability to handle the failures
Optimization Techniques:
- Use Delta Formatted tables for Raw Vault, Business Vault and Gold layer tables.
- Make sure to use OPTIMIZE and Z-order indexes on all join keys of Hubs, Links and Satellites.
- Do not over partition the tables -especially the smaller satellites tables. Use Bloom filter indexing on Date columns, current flag columns and predicate columns that are typically filtered on to ensure best performance - especially if you need to create additional indices apart from Z-order.
- Delta Live Tables (Materialized Views) makes creating and managing PIT tables very easy.
- Reduce the optimize.maxFileSize to a lower number, such as 32-64MB vs. the default of 1 GB. By creating smaller files, you can benefit from file pruning and minimize the I/O retrieving the data you need to join.
- Data Vault model has comparatively more joins, so use the latest version of DBR which ensures that the Adaptive Query Execution is ON by default so that the best Join strategy is automatically used. Use Join hints only if necessary. ( for advanced performance tuning).

Tech Layer Overview & Tech Segmentation:

Feature	Raw Vault	Business Vault
Contains	Hubs, Links, Satellites	PIT, Bridge, Derived views
Tech	Delta Live Tables (DLT)	Views/MVs over DLT
Purpose	Historical, auditable structure	Business-friendly, performant querying
Access	Restricted via Unity Catalog	Shared to consumers
Change Rate	Slowly changing	Frequently updated (PITs)
Storage	Immutable	Derived, read-optimized

Layer	Sub-Layer	Description	Technology
Bronze	Landing Zone	Raw ingestion	Autoloader + DLT
Silver	Staging Zone	Cleansed, deduped data	DLT
Gold	Raw Vault	Hubs, Links, Satellites	DLT
	Business Vault	PIT, Bridge, business rules	Views/MVs on DLT
Consumer	Access Layer	Denormalized, analytics-friendly	Unity Catalog Views
Storage	Immutable	Derived, read-optimized

Data Mesh:

What is Data Mesh:

It's a democratized approach to managing data where various domains operationalize their data, relieving the Central Data/Analytics team from designing and developing data products. Instead, Central teams focus on providing and governing Data resources using a self-service platform.

Data Mesh Principles:

Principle	Description
Domain-oriented decentralized data ownership and architecture	So that the ecosystem creating and consuming data can scale out as the number of sources of data, number of use cases, and diversity of access models to the data increases; simply increase the autonomous nodes on the mesh.
Data as a product	So that data users can easily discover, understand and securely use high quality data with a delightful experience; data that is distributed across many domains.
Self-serve data infrastructure as a platform	So that the domain teams can create and consume data products autonomously using the platform abstractions, hiding the complexity of building, executing and maintaining secure and interoperable data products.
Federated computational governance	So that data users can get value from aggregation and correlation of independent data products - the mesh is behaving as an ecosystem following global interoperability standards; standards that are baked computationally into the platform.

Data Mesh Architecture Pattern: It is an architecture pattern where each functional data domain is represented as nodes and is interconnected, managed, and governed by a centralized IT/Governance node. Each data domain can host multiple data products that can be shared across different data domains using the same centralized IT/governance mode.
- Data Product:
  The domain team develops and exposes data products that provide access to the domain’s data in a consistent and consumable way.
  A Data Product facilitates an end goal through the use of data.
  Its objective is to provide this data in a clean, standardized, and proper way as a product to the other domain teams
Build Data Mesh on Databricks:
Databricks provides many features, including data ingestion, data transformation, SQL, AI/ML and many more, making it a complete unified data platform. It takes away complexity involved with multiple tools/services and interoperability between them. This unified platform nature of Databricks makes it an ideal platform to implement Data Mesh architecture that demands heterogenous data types, use cases and data delivery methods. Data Mesh principles can be aligned to the design on Databricks.
- Data Domain & Product Platform Building:
  - A common framework can be built which can use to onboard various data domains.
    - This platform can be configurable for all data processing steps including ingestion, data cleansing, applying Transformation/ ETL logic at domain level.
    - DLT pipeline is one of the good Architectural decision as it provides the configurable approach for each step of data processing including setting up DQ rules.
    - Each Data Product can have its own catalog. DLT would be configured for all these inputs and can be scaled for all the Data Products.
  - Building the separate data domain can be separated by having separate Workspaces. Since Domain can have multiple product so catalog to WS binding would be helpful and access can be controlled within domain/ Product level.
  - Data Sharing between Domains and with Hub can be controlled by Delta Sharing.
- Building Centralized Hub/ Self Service Platform:
  - Data products are published to the data hub, which owns and manages a majority of assets registered in Unity Catalog.
  - Data products are published to the data hub, which owns and manages a majority of assets registered in Unity Catalog.
- Federated Governance:
  - Data cataloging, lineage, audit, and access control via Unity Catalog.
  - Unity Catalog provides not only informational cataloging capabilities such as data discovery and lineage, but also the enforcement of fine-grained access controls and auditing.
Self Service Data Layer:
- As part of Data Mesh Architecture data can be fetched directly from its domain as its serve as a Data Product. Internal/ External users/systems/apps can fetch the data directly which is published by Domain itself.
- Proper Governance model would require to restrict the access.
- Access control can be segregated for other domain/ internal/ external users.
- As shown, For external apps and users, it is advisable to follow industry-recognized 3-layer architecture to isolate the front, back, and database tiers into different networks. Microservice-based architecture is recommended for better control and reusability. All microservices about a service functionality, e.g., creating a workspace or repo, can be written using Databricks APIs, and leveraged from Web Tier.
- Only Web Tier can be accessed from the public internet. API and DB Tier are isolated from the internet
  - Web Tier: Front end for the self-service portal, accessed from the internet.
  - API Tier: Backbone of the self-service portal. Hosts microservices for different Databricks APIs
  - Metadata DB: Lean metadata layer to store functional details of a data domain and data products. Also, it stores particulars about the accessibility of a data product and consumers.
Central Analytics – Data Denormalization:
- After building the Hub Layer, which will have the data from different domains. This data would be mostly Dimension/ Metadata specific to single domain.
- In order to build the Central Analytics system, it would require to ingest the ops data separately.
- This layer can be created separately on top of the Hub.
- A separate schema/ catalog and Workspace can be created for this processing. This would be typically arranged in denormalized way by have Star/ Snowflake Schema on top of Dimension model.
  - Creating a separate catalog might be require if this analytical layer would server at Enterprise level. Setting up the separate pipeline (DLT in batch mode or scheduling the job from pyspark notebook would serve the purpose here).
  - Creating the schema would be sufficient, if it just requires building another denormalized view of data from Domains and ingested transactional data set.
- It would require implementing the typical approach of building the SCD Type 2 on few domain data sets., which can be configured if DLT pipeline is used or built in pyspark notebook.

Databricks Community

Building Data Models on Databricks Platform

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples