<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Building Data Models on Databricks Platform in Warehousing &amp; Analytics</title>
    <link>https://community.databricks.com/t5/warehousing-analytics/building-data-models-on-databricks-platform/m-p/126229#M2169</link>
    <description>&lt;P&gt;It will describe about Data models and how to build them on Databricks Platform, especially will Data vault and Data Mesh.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Data Vault&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;What is Data vault?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Data Vault is a modern data modeling technique designed for agile, scalable, and auditable enterprise data warehouses. It separates core business concepts, their relationships, and descriptive attributes into distinct components.&lt;BR /&gt;It is a "write-optimized" modeling style.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_0-1753270638873.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18401i03FCECFC2CA18894/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_0-1753270638873.png" alt="rathorer_0-1753270638873.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Key Components:&lt;/STRONG&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Hub&lt;/STRONG&gt; – Stores unique business keys (e.g., CustomerID)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Link&lt;/STRONG&gt; – Stores relationships between hubs (e.g., Customer ↔ Order)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Satellite&lt;/STRONG&gt; – Stores descriptive attributes and history (e.g., Customer Name, Status)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Point-In-Time (PIT) &lt;/STRONG&gt;- Some satellites and hubs are pre-joined and provide some WHERE conditions with "point in time" filtering. Bridge tables (PIT) pre-joins hubs or entities to provide a flattened "dimensional table" like views for Entities.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Raw Vault &lt;/STRONG&gt;– Data is modeled as Hubs, Links and Satellite tables in the Raw Data Vault&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Business vault &lt;/STRONG&gt;– Created by applying All the ETL business rules, data quality rules, cleansing and conforming rules. It can serve as an enterprise "central repository" of standardized cleansed data.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Why Use it?&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Scalability&lt;/STRONG&gt;: Hubs are unlikely to change so it adds the stability and Satellite the flexibility as it can be extended easily for Hub Keys, which makes the model Scalable &amp;amp; audit friendly.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Parallelism&lt;/STRONG&gt;: High Level of parallelism can be achieved as there is less dependency between the tables of the model. For example, hubs or satellites for the customer, product, order could all be loaded in parallel.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Canonical&lt;/STRONG&gt;: Uses the source metadata in Raw Vault to preserve the single version of Facts.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Adaptability&lt;/STRONG&gt;: New Hubs or Satellites could be easily added to the model incrementally without massive refactoring of ETL.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Databricks Alignment&lt;/STRONG&gt;: DV approaches aligns with Lakehouse approach.&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_15-1753272761286.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18419i09A6349F3D2C29E5/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_15-1753272761286.png" alt="rathorer_15-1753272761286.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_16-1753272771387.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18420i1B5246F50229C516/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_16-1753272771387.png" alt="rathorer_16-1753272771387.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Design Consideration for Databricks:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Key Management: &lt;/STRONG&gt;Building the Business Key for HUBs can be done by hashing algorithms.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Staging Layer&lt;/STRONG&gt;: DV needs building landing/ staging zone that would act as Bronze layer of medallion Architecture.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Building Medallion Architecture for Raw &amp;amp; Business Vault with PIT Tables&lt;/STRONG&gt;: Creation of Hub, Satellite and Link tables can be done in Silver layer. Additionally applying the business rules and creating the Business vault can be served as Central Repository of Data platform.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;DLT Alignment&lt;/STRONG&gt;:&lt;UL&gt;&lt;LI&gt;Parallel Processing for Hub and Sat Tables&lt;/LI&gt;&lt;LI&gt;Sat Tables load can be configured for any change.&lt;/LI&gt;&lt;LI&gt;Sequential Dependency for Links and BV.&lt;/LI&gt;&lt;LI&gt;Apply DQ checks to build BV layer&lt;/LI&gt;&lt;LI&gt;PIT table creation can be orchestrated by MV.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_18-1753272935114.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18422i6318AB7B03349FAD/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_18-1753272935114.png" alt="rathorer_18-1753272935114.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Implementation of DV on Databricks:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Bronze Layer: &lt;/STRONG&gt;Bronze Layer is created from ingested data. It may require to process/flatten complex structured data&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Raw Vault&lt;/STRONG&gt;&lt;STRONG&gt;: &lt;/STRONG&gt;It requires to create Hub, Sat and Link tables. It will be Single Source of Truth for Data platform on Databricks.&lt;UL&gt;&lt;LI&gt;The key columns used for this layer would be created by applying hashing function. For example - sha1(concat(UPPER(TRIM(c_name)),UPPER(TRIM(c_address)),UPPER(TRIM(c_phone)),UPPER(TRIM(c_mktsegment)))) as hash_diff&lt;/LI&gt;&lt;LI&gt;Add Timestamp Columns for Auditing purpose.&lt;/LI&gt;&lt;LI&gt;DQ Rules on top of Raw Vault tables can be added to populate the Business Vault by using Expect from DLT.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Business Vault&lt;/STRONG&gt;&lt;STRONG&gt;: &lt;/STRONG&gt;Apply the business/ transformation logic on top of Raw layer.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Data Mart by PIT&lt;/STRONG&gt;&lt;STRONG&gt;: &lt;/STRONG&gt;This layer would be used for consumption directly and refreshing with incremental data set or building point in time views for latest data sets can be done my building MV on top of DLT.&lt;UL&gt;&lt;LI&gt;This would act as typical denormalized table of Dimension modelling.&lt;/LI&gt;&lt;LI&gt;If complex join logics to be built for consumption, it can be pre-build and stored in Delta table for later consumption.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;DLT Pipeline Setup&lt;/STRONG&gt;&lt;STRONG&gt;:&amp;nbsp;&lt;/STRONG&gt;&lt;UL&gt;&lt;LI&gt;Setup Orchestration for parallel/ Sequential processing.&lt;/LI&gt;&lt;LI&gt;Set re-start ability to handle the failures&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_0-1753273211716.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18423iADB0A9EEAC7C1879/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_0-1753273211716.png" alt="rathorer_0-1753273211716.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Optimization Techniques:&lt;/STRONG&gt;&lt;UL&gt;&lt;LI&gt;Use Delta Formatted tables for Raw Vault, Business Vault and Gold layer tables.&lt;/LI&gt;&lt;LI&gt;Make sure to use OPTIMIZE and Z-order indexes on all join keys of Hubs, Links and Satellites.&lt;/LI&gt;&lt;LI&gt;Do not over partition the tables -especially the smaller satellites tables. Use Bloom filter indexing on Date columns, current flag columns and predicate columns that are typically filtered on to ensure best performance - especially if you need to create additional indices apart from Z-order.&lt;/LI&gt;&lt;LI&gt;Delta Live Tables (Materialized Views) makes creating and managing PIT tables very easy.&lt;/LI&gt;&lt;LI&gt;Reduce the optimize.maxFileSize to a lower number, such as 32-64MB vs. the default of 1 GB. By creating smaller files, you can benefit from file pruning and minimize the I/O retrieving the data you need to join.&lt;/LI&gt;&lt;LI&gt;Data Vault model has comparatively more joins, so use the latest version of DBR which ensures that the Adaptive Query Execution is ON by default so that the best Join strategy is automatically used. Use Join hints only if necessary. ( for advanced performance tuning).&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Tech Layer Overview &amp;amp; Tech Segmentation:&lt;/STRONG&gt;&lt;/P&gt;&lt;TABLE width="985px"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;&lt;STRONG&gt;Feature&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;&lt;STRONG&gt;Raw Vault&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;&lt;STRONG&gt;Business Vault&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;Contains&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;Hubs, Links, Satellites&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;PIT, Bridge, Derived views&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;Tech&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;Delta Live Tables (DLT)&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;Views/MVs over DLT&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;Purpose&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;Historical, auditable structure&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;Business-friendly, performant querying&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;Access&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;Restricted via Unity Catalog&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;Shared to consumers&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;Change Rate&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;Slowly changing&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;Frequently updated (PITs)&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;Storage&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;Immutable&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;Derived, read-optimized&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;TABLE width="1230px"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;&lt;STRONG&gt;Layer&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;&lt;STRONG&gt;Sub-Layer&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;&lt;STRONG&gt;Description&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;&lt;STRONG&gt;Technology&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;Bronze&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;Landing Zone&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;Raw ingestion&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;Autoloader + DLT&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;Silver&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;Staging Zone&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;Cleansed, deduped data&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;DLT&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;Gold&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;Raw Vault&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;Hubs, Links, Satellites&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;DLT&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;Business Vault&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;PIT, Bridge, business rules&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;Views/MVs on DLT&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;Consumer&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;Access Layer&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;Denormalized, analytics-friendly&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;Unity Catalog Views&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;Storage&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;Immutable&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;Derived, read-optimized&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;Data Mesh:&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;What is Data Mesh:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;It's a democratized approach to managing data where various domains operationalize their data, relieving the Central Data/Analytics team from designing and developing data products. Instead, Central teams focus on providing and governing Data resources using a self-service platform.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_1-1753273720055.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18424i3ADC92475B6C0BDA/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_1-1753273720055.png" alt="rathorer_1-1753273720055.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Data Mesh Principles:&lt;/STRONG&gt; &lt;TABLE width="1822px"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="363.885px"&gt;&lt;P&gt;&lt;STRONG&gt;Principle&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="1457.45px"&gt;&lt;P&gt;&lt;STRONG&gt;Description&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="363.885px"&gt;&lt;P&gt;Domain-oriented decentralized data ownership and architecture&lt;/P&gt;&lt;/TD&gt;&lt;TD width="1457.45px"&gt;&lt;P&gt;So that&amp;nbsp;the ecosystem creating and consuming data can scale out as the number of sources of data, number of use cases, and diversity of access models to the data increases; simply increase the autonomous nodes on the mesh.&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="363.885px"&gt;&lt;P&gt;Data as a product&lt;/P&gt;&lt;/TD&gt;&lt;TD width="1457.45px"&gt;&lt;P&gt;So that&amp;nbsp;data users can easily discover, understand and securely use high quality data with a delightful experience; data that is distributed across many domains.&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="363.885px"&gt;&lt;P&gt;Self-serve data infrastructure as a platform&lt;/P&gt;&lt;/TD&gt;&lt;TD width="1457.45px"&gt;&lt;P&gt;So that&amp;nbsp;the domain teams can create and consume data products autonomously using the platform abstractions, hiding the complexity of building, executing and maintaining secure and interoperable data products.&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="363.885px"&gt;&lt;P&gt;Federated computational governance&lt;/P&gt;&lt;/TD&gt;&lt;TD width="1457.45px"&gt;&lt;P&gt;So that&amp;nbsp;data users can get value from aggregation and correlation of independent data products - the mesh is behaving as an ecosystem following global interoperability standards; standards that are baked computationally into the platform.&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Data Mesh Architecture Pattern:&amp;nbsp;&lt;/STRONG&gt;It is an architecture pattern where each functional data domain is represented as nodes and is interconnected, managed, and governed by a centralized IT/Governance node. Each data domain can host multiple data products that can be shared across different data domains using the same centralized IT/governance mode.&lt;/LI&gt;&lt;LI&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_0-1753283798836.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18436i8103CE3398EDA1AC/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_0-1753283798836.png" alt="rathorer_0-1753283798836.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Data Product&lt;/STRONG&gt;:&amp;nbsp;&lt;/P&gt;&lt;P&gt;The domain team develops and exposes data products that provide access to the domain’s data in a consistent and consumable way.&lt;/P&gt;&lt;P&gt;A Data Product facilitates an end goal through the use of data.&lt;/P&gt;&lt;P&gt;Its objective is to provide this data in a clean, standardized, and proper way as a product to the other domain teams&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_1-1753283833115.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18437iF158A810D4BB9E70/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_1-1753283833115.png" alt="rathorer_1-1753283833115.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;STRONG&gt;Build Data Mesh on Databricks:&amp;nbsp;&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;P&gt;Databricks provides many features, including data ingestion, data transformation, SQL, AI/ML and many more, making it a complete unified data platform. It takes away complexity involved with multiple tools/services and interoperability between them. This unified platform nature of Databricks makes it an ideal platform to implement Data Mesh architecture that demands heterogenous data types, use cases and data delivery methods. Data Mesh principles can be aligned to the design on Databricks.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Data Domain&lt;/STRONG&gt; &lt;STRONG&gt;&amp;amp;&lt;/STRONG&gt; &lt;STRONG&gt;Product&lt;/STRONG&gt; &lt;STRONG&gt;Platform&lt;/STRONG&gt; &lt;STRONG&gt;Building&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;A common framework can be built which can use to onboard various data domains.&lt;UL&gt;&lt;LI&gt;This platform can be configurable for all data processing steps including ingestion, data cleansing, applying Transformation/ ETL logic at domain level.&lt;/LI&gt;&lt;LI&gt;DLT pipeline is one of the good Architectural decision as it provides the configurable approach for each step of data processing including setting up DQ rules.&lt;/LI&gt;&lt;LI&gt;Each Data Product can have its own catalog. DLT would be configured for all these inputs and can be scaled for all the Data Products.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;Building the separate data domain can be separated by having separate Workspaces. Since Domain can have multiple product so catalog to WS binding would be helpful and access can be controlled within domain/ Product level.&lt;/LI&gt;&lt;LI&gt;Data Sharing between Domains and with Hub can be controlled by Delta Sharing.&lt;/LI&gt;&lt;LI&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_2-1753284068176.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18439i719036D029D7E393/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_2-1753284068176.png" alt="rathorer_2-1753284068176.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Building Centralized Hub/ Self Service Platform&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Data products are published to the data hub, which owns and manages a majority of assets registered in Unity Catalog.&lt;/LI&gt;&lt;LI&gt;Data products are published to the data hub, which owns and manages a majority of assets registered in Unity Catalog.&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_3-1753284162637.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18440i8F882A9998149EF5/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_3-1753284162637.png" alt="rathorer_3-1753284162637.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Federated Governance&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Data cataloging, lineage, audit, and access control via Unity Catalog.&lt;/LI&gt;&lt;LI&gt;Unity Catalog provides not only&amp;nbsp;&lt;I&gt;informational&lt;/I&gt;&amp;nbsp;cataloging capabilities such as data discovery and lineage, but also the&amp;nbsp;&lt;I&gt;enforcement&lt;/I&gt;&amp;nbsp;of fine-grained access controls and auditing.&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_4-1753284208418.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18441i8E29BEC2460FF999/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_4-1753284208418.png" alt="rathorer_4-1753284208418.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Self Service Data Layer:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;As part of Data Mesh Architecture data can be fetched directly from its domain as its serve as a Data Product. Internal/ External users/systems/apps can fetch the data directly which is published by Domain itself.&lt;/LI&gt;&lt;LI&gt;Proper Governance model would require to restrict the access.&lt;/LI&gt;&lt;LI&gt;Access control can be segregated for other domain/ internal/ external users.&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_0-1753284389056.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18442i978A9C700BB72DD9/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_0-1753284389056.png" alt="rathorer_0-1753284389056.png" /&gt;&lt;/span&gt;&lt;/LI&gt;&lt;LI&gt;As shown, For external apps and users, it is advisable to follow industry-recognized 3-layer architecture to isolate the front, back, and database tiers into different networks. Microservice-based architecture is recommended for better control and reusability. All microservices about a service functionality, e.g., creating a workspace or repo, can be written using Databricks APIs, and leveraged from Web Tier.&lt;/LI&gt;&lt;LI&gt;Only Web Tier can be accessed from the public internet. API and DB Tier are isolated from the internet&lt;UL&gt;&lt;LI&gt;Web Tier: Front end for the self-service portal, accessed from the internet.&lt;/LI&gt;&lt;LI&gt;API Tier: Backbone of the self-service portal. Hosts microservices for different Databricks APIs&lt;/LI&gt;&lt;LI&gt;Metadata DB: Lean metadata layer to store functional details of a data domain and data products. Also, it stores particulars about the accessibility of a data product and consumers.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Central Analytics – Data Denormalization:&lt;/STRONG&gt;&lt;UL&gt;&lt;LI&gt;After building the Hub Layer, which will have the data from different domains. This data would be mostly Dimension/ Metadata specific to single domain.&lt;/LI&gt;&lt;LI&gt;In order to build the Central Analytics system, it would require to ingest the ops data separately.&lt;/LI&gt;&lt;LI&gt;This layer can be created separately on top of the Hub.&lt;/LI&gt;&lt;LI&gt;A separate schema/ catalog and Workspace can be created for this processing. This would be typically arranged in denormalized way by have Star/ Snowflake Schema on top of Dimension model.&lt;UL&gt;&lt;LI&gt;Creating a separate catalog might be require if this analytical layer would server at Enterprise level. Setting up the separate pipeline (DLT in batch mode or scheduling the job from pyspark notebook would serve the purpose here).&lt;/LI&gt;&lt;LI&gt;Creating the schema would be sufficient, if it just requires building another denormalized view of data from Domains and ingested transactional data set.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;It would require implementing the typical approach of building the SCD Type 2 on few domain data sets., which can be configured if DLT pipeline is used or built in pyspark notebook.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;</description>
    <pubDate>Wed, 23 Jul 2025 15:34:18 GMT</pubDate>
    <dc:creator>rathorer</dc:creator>
    <dc:date>2025-07-23T15:34:18Z</dc:date>
    <item>
      <title>Building Data Models on Databricks Platform</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/building-data-models-on-databricks-platform/m-p/126229#M2169</link>
      <description>&lt;P&gt;It will describe about Data models and how to build them on Databricks Platform, especially will Data vault and Data Mesh.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Data Vault&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;What is Data vault?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Data Vault is a modern data modeling technique designed for agile, scalable, and auditable enterprise data warehouses. It separates core business concepts, their relationships, and descriptive attributes into distinct components.&lt;BR /&gt;It is a "write-optimized" modeling style.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_0-1753270638873.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18401i03FCECFC2CA18894/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_0-1753270638873.png" alt="rathorer_0-1753270638873.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Key Components:&lt;/STRONG&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Hub&lt;/STRONG&gt; – Stores unique business keys (e.g., CustomerID)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Link&lt;/STRONG&gt; – Stores relationships between hubs (e.g., Customer ↔ Order)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Satellite&lt;/STRONG&gt; – Stores descriptive attributes and history (e.g., Customer Name, Status)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Point-In-Time (PIT) &lt;/STRONG&gt;- Some satellites and hubs are pre-joined and provide some WHERE conditions with "point in time" filtering. Bridge tables (PIT) pre-joins hubs or entities to provide a flattened "dimensional table" like views for Entities.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Raw Vault &lt;/STRONG&gt;– Data is modeled as Hubs, Links and Satellite tables in the Raw Data Vault&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Business vault &lt;/STRONG&gt;– Created by applying All the ETL business rules, data quality rules, cleansing and conforming rules. It can serve as an enterprise "central repository" of standardized cleansed data.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Why Use it?&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Scalability&lt;/STRONG&gt;: Hubs are unlikely to change so it adds the stability and Satellite the flexibility as it can be extended easily for Hub Keys, which makes the model Scalable &amp;amp; audit friendly.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Parallelism&lt;/STRONG&gt;: High Level of parallelism can be achieved as there is less dependency between the tables of the model. For example, hubs or satellites for the customer, product, order could all be loaded in parallel.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Canonical&lt;/STRONG&gt;: Uses the source metadata in Raw Vault to preserve the single version of Facts.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Adaptability&lt;/STRONG&gt;: New Hubs or Satellites could be easily added to the model incrementally without massive refactoring of ETL.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Databricks Alignment&lt;/STRONG&gt;: DV approaches aligns with Lakehouse approach.&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_15-1753272761286.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18419i09A6349F3D2C29E5/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_15-1753272761286.png" alt="rathorer_15-1753272761286.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_16-1753272771387.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18420i1B5246F50229C516/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_16-1753272771387.png" alt="rathorer_16-1753272771387.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Design Consideration for Databricks:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Key Management: &lt;/STRONG&gt;Building the Business Key for HUBs can be done by hashing algorithms.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Staging Layer&lt;/STRONG&gt;: DV needs building landing/ staging zone that would act as Bronze layer of medallion Architecture.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Building Medallion Architecture for Raw &amp;amp; Business Vault with PIT Tables&lt;/STRONG&gt;: Creation of Hub, Satellite and Link tables can be done in Silver layer. Additionally applying the business rules and creating the Business vault can be served as Central Repository of Data platform.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;DLT Alignment&lt;/STRONG&gt;:&lt;UL&gt;&lt;LI&gt;Parallel Processing for Hub and Sat Tables&lt;/LI&gt;&lt;LI&gt;Sat Tables load can be configured for any change.&lt;/LI&gt;&lt;LI&gt;Sequential Dependency for Links and BV.&lt;/LI&gt;&lt;LI&gt;Apply DQ checks to build BV layer&lt;/LI&gt;&lt;LI&gt;PIT table creation can be orchestrated by MV.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_18-1753272935114.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18422i6318AB7B03349FAD/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_18-1753272935114.png" alt="rathorer_18-1753272935114.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Implementation of DV on Databricks:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Bronze Layer: &lt;/STRONG&gt;Bronze Layer is created from ingested data. It may require to process/flatten complex structured data&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Raw Vault&lt;/STRONG&gt;&lt;STRONG&gt;: &lt;/STRONG&gt;It requires to create Hub, Sat and Link tables. It will be Single Source of Truth for Data platform on Databricks.&lt;UL&gt;&lt;LI&gt;The key columns used for this layer would be created by applying hashing function. For example - sha1(concat(UPPER(TRIM(c_name)),UPPER(TRIM(c_address)),UPPER(TRIM(c_phone)),UPPER(TRIM(c_mktsegment)))) as hash_diff&lt;/LI&gt;&lt;LI&gt;Add Timestamp Columns for Auditing purpose.&lt;/LI&gt;&lt;LI&gt;DQ Rules on top of Raw Vault tables can be added to populate the Business Vault by using Expect from DLT.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Business Vault&lt;/STRONG&gt;&lt;STRONG&gt;: &lt;/STRONG&gt;Apply the business/ transformation logic on top of Raw layer.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Data Mart by PIT&lt;/STRONG&gt;&lt;STRONG&gt;: &lt;/STRONG&gt;This layer would be used for consumption directly and refreshing with incremental data set or building point in time views for latest data sets can be done my building MV on top of DLT.&lt;UL&gt;&lt;LI&gt;This would act as typical denormalized table of Dimension modelling.&lt;/LI&gt;&lt;LI&gt;If complex join logics to be built for consumption, it can be pre-build and stored in Delta table for later consumption.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;DLT Pipeline Setup&lt;/STRONG&gt;&lt;STRONG&gt;:&amp;nbsp;&lt;/STRONG&gt;&lt;UL&gt;&lt;LI&gt;Setup Orchestration for parallel/ Sequential processing.&lt;/LI&gt;&lt;LI&gt;Set re-start ability to handle the failures&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_0-1753273211716.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18423iADB0A9EEAC7C1879/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_0-1753273211716.png" alt="rathorer_0-1753273211716.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Optimization Techniques:&lt;/STRONG&gt;&lt;UL&gt;&lt;LI&gt;Use Delta Formatted tables for Raw Vault, Business Vault and Gold layer tables.&lt;/LI&gt;&lt;LI&gt;Make sure to use OPTIMIZE and Z-order indexes on all join keys of Hubs, Links and Satellites.&lt;/LI&gt;&lt;LI&gt;Do not over partition the tables -especially the smaller satellites tables. Use Bloom filter indexing on Date columns, current flag columns and predicate columns that are typically filtered on to ensure best performance - especially if you need to create additional indices apart from Z-order.&lt;/LI&gt;&lt;LI&gt;Delta Live Tables (Materialized Views) makes creating and managing PIT tables very easy.&lt;/LI&gt;&lt;LI&gt;Reduce the optimize.maxFileSize to a lower number, such as 32-64MB vs. the default of 1 GB. By creating smaller files, you can benefit from file pruning and minimize the I/O retrieving the data you need to join.&lt;/LI&gt;&lt;LI&gt;Data Vault model has comparatively more joins, so use the latest version of DBR which ensures that the Adaptive Query Execution is ON by default so that the best Join strategy is automatically used. Use Join hints only if necessary. ( for advanced performance tuning).&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Tech Layer Overview &amp;amp; Tech Segmentation:&lt;/STRONG&gt;&lt;/P&gt;&lt;TABLE width="985px"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;&lt;STRONG&gt;Feature&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;&lt;STRONG&gt;Raw Vault&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;&lt;STRONG&gt;Business Vault&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;Contains&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;Hubs, Links, Satellites&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;PIT, Bridge, Derived views&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;Tech&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;Delta Live Tables (DLT)&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;Views/MVs over DLT&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;Purpose&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;Historical, auditable structure&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;Business-friendly, performant querying&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;Access&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;Restricted via Unity Catalog&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;Shared to consumers&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;Change Rate&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;Slowly changing&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;Frequently updated (PITs)&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="326.76px"&gt;&lt;P&gt;Storage&lt;/P&gt;&lt;/TD&gt;&lt;TD width="308.792px"&gt;&lt;P&gt;Immutable&lt;/P&gt;&lt;/TD&gt;&lt;TD width="348.781px"&gt;&lt;P&gt;Derived, read-optimized&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;TABLE width="1230px"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;&lt;STRONG&gt;Layer&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;&lt;STRONG&gt;Sub-Layer&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;&lt;STRONG&gt;Description&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;&lt;STRONG&gt;Technology&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;Bronze&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;Landing Zone&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;Raw ingestion&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;Autoloader + DLT&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;Silver&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;Staging Zone&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;Cleansed, deduped data&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;DLT&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;Gold&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;Raw Vault&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;Hubs, Links, Satellites&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;DLT&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;Business Vault&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;PIT, Bridge, business rules&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;Views/MVs on DLT&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;Consumer&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;Access Layer&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;Denormalized, analytics-friendly&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;Unity Catalog Views&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="335.792px"&gt;&lt;P&gt;Storage&lt;/P&gt;&lt;/TD&gt;&lt;TD width="317.812px"&gt;&lt;P&gt;Immutable&lt;/P&gt;&lt;/TD&gt;&lt;TD width="357.802px"&gt;&lt;P&gt;Derived, read-optimized&lt;/P&gt;&lt;/TD&gt;&lt;TD width="217.927px"&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;Data Mesh:&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;What is Data Mesh:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;It's a democratized approach to managing data where various domains operationalize their data, relieving the Central Data/Analytics team from designing and developing data products. Instead, Central teams focus on providing and governing Data resources using a self-service platform.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_1-1753273720055.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18424i3ADC92475B6C0BDA/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_1-1753273720055.png" alt="rathorer_1-1753273720055.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Data Mesh Principles:&lt;/STRONG&gt; &lt;TABLE width="1822px"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD width="363.885px"&gt;&lt;P&gt;&lt;STRONG&gt;Principle&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;TD width="1457.45px"&gt;&lt;P&gt;&lt;STRONG&gt;Description&lt;/STRONG&gt;&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="363.885px"&gt;&lt;P&gt;Domain-oriented decentralized data ownership and architecture&lt;/P&gt;&lt;/TD&gt;&lt;TD width="1457.45px"&gt;&lt;P&gt;So that&amp;nbsp;the ecosystem creating and consuming data can scale out as the number of sources of data, number of use cases, and diversity of access models to the data increases; simply increase the autonomous nodes on the mesh.&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="363.885px"&gt;&lt;P&gt;Data as a product&lt;/P&gt;&lt;/TD&gt;&lt;TD width="1457.45px"&gt;&lt;P&gt;So that&amp;nbsp;data users can easily discover, understand and securely use high quality data with a delightful experience; data that is distributed across many domains.&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="363.885px"&gt;&lt;P&gt;Self-serve data infrastructure as a platform&lt;/P&gt;&lt;/TD&gt;&lt;TD width="1457.45px"&gt;&lt;P&gt;So that&amp;nbsp;the domain teams can create and consume data products autonomously using the platform abstractions, hiding the complexity of building, executing and maintaining secure and interoperable data products.&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD width="363.885px"&gt;&lt;P&gt;Federated computational governance&lt;/P&gt;&lt;/TD&gt;&lt;TD width="1457.45px"&gt;&lt;P&gt;So that&amp;nbsp;data users can get value from aggregation and correlation of independent data products - the mesh is behaving as an ecosystem following global interoperability standards; standards that are baked computationally into the platform.&lt;/P&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Data Mesh Architecture Pattern:&amp;nbsp;&lt;/STRONG&gt;It is an architecture pattern where each functional data domain is represented as nodes and is interconnected, managed, and governed by a centralized IT/Governance node. Each data domain can host multiple data products that can be shared across different data domains using the same centralized IT/governance mode.&lt;/LI&gt;&lt;LI&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_0-1753283798836.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18436i8103CE3398EDA1AC/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_0-1753283798836.png" alt="rathorer_0-1753283798836.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Data Product&lt;/STRONG&gt;:&amp;nbsp;&lt;/P&gt;&lt;P&gt;The domain team develops and exposes data products that provide access to the domain’s data in a consistent and consumable way.&lt;/P&gt;&lt;P&gt;A Data Product facilitates an end goal through the use of data.&lt;/P&gt;&lt;P&gt;Its objective is to provide this data in a clean, standardized, and proper way as a product to the other domain teams&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_1-1753283833115.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18437iF158A810D4BB9E70/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_1-1753283833115.png" alt="rathorer_1-1753283833115.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;&lt;STRONG&gt;Build Data Mesh on Databricks:&amp;nbsp;&lt;/STRONG&gt;&lt;/STRONG&gt;&lt;P&gt;Databricks provides many features, including data ingestion, data transformation, SQL, AI/ML and many more, making it a complete unified data platform. It takes away complexity involved with multiple tools/services and interoperability between them. This unified platform nature of Databricks makes it an ideal platform to implement Data Mesh architecture that demands heterogenous data types, use cases and data delivery methods. Data Mesh principles can be aligned to the design on Databricks.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Data Domain&lt;/STRONG&gt; &lt;STRONG&gt;&amp;amp;&lt;/STRONG&gt; &lt;STRONG&gt;Product&lt;/STRONG&gt; &lt;STRONG&gt;Platform&lt;/STRONG&gt; &lt;STRONG&gt;Building&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;A common framework can be built which can use to onboard various data domains.&lt;UL&gt;&lt;LI&gt;This platform can be configurable for all data processing steps including ingestion, data cleansing, applying Transformation/ ETL logic at domain level.&lt;/LI&gt;&lt;LI&gt;DLT pipeline is one of the good Architectural decision as it provides the configurable approach for each step of data processing including setting up DQ rules.&lt;/LI&gt;&lt;LI&gt;Each Data Product can have its own catalog. DLT would be configured for all these inputs and can be scaled for all the Data Products.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;Building the separate data domain can be separated by having separate Workspaces. Since Domain can have multiple product so catalog to WS binding would be helpful and access can be controlled within domain/ Product level.&lt;/LI&gt;&lt;LI&gt;Data Sharing between Domains and with Hub can be controlled by Delta Sharing.&lt;/LI&gt;&lt;LI&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_2-1753284068176.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18439i719036D029D7E393/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_2-1753284068176.png" alt="rathorer_2-1753284068176.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Building Centralized Hub/ Self Service Platform&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Data products are published to the data hub, which owns and manages a majority of assets registered in Unity Catalog.&lt;/LI&gt;&lt;LI&gt;Data products are published to the data hub, which owns and manages a majority of assets registered in Unity Catalog.&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_3-1753284162637.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18440i8F882A9998149EF5/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_3-1753284162637.png" alt="rathorer_3-1753284162637.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Federated Governance&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Data cataloging, lineage, audit, and access control via Unity Catalog.&lt;/LI&gt;&lt;LI&gt;Unity Catalog provides not only&amp;nbsp;&lt;I&gt;informational&lt;/I&gt;&amp;nbsp;cataloging capabilities such as data discovery and lineage, but also the&amp;nbsp;&lt;I&gt;enforcement&lt;/I&gt;&amp;nbsp;of fine-grained access controls and auditing.&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_4-1753284208418.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18441i8E29BEC2460FF999/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_4-1753284208418.png" alt="rathorer_4-1753284208418.png" /&gt;&lt;/span&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Self Service Data Layer:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;As part of Data Mesh Architecture data can be fetched directly from its domain as its serve as a Data Product. Internal/ External users/systems/apps can fetch the data directly which is published by Domain itself.&lt;/LI&gt;&lt;LI&gt;Proper Governance model would require to restrict the access.&lt;/LI&gt;&lt;LI&gt;Access control can be segregated for other domain/ internal/ external users.&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="rathorer_0-1753284389056.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18442i978A9C700BB72DD9/image-size/medium?v=v2&amp;amp;px=400" role="button" title="rathorer_0-1753284389056.png" alt="rathorer_0-1753284389056.png" /&gt;&lt;/span&gt;&lt;/LI&gt;&lt;LI&gt;As shown, For external apps and users, it is advisable to follow industry-recognized 3-layer architecture to isolate the front, back, and database tiers into different networks. Microservice-based architecture is recommended for better control and reusability. All microservices about a service functionality, e.g., creating a workspace or repo, can be written using Databricks APIs, and leveraged from Web Tier.&lt;/LI&gt;&lt;LI&gt;Only Web Tier can be accessed from the public internet. API and DB Tier are isolated from the internet&lt;UL&gt;&lt;LI&gt;Web Tier: Front end for the self-service portal, accessed from the internet.&lt;/LI&gt;&lt;LI&gt;API Tier: Backbone of the self-service portal. Hosts microservices for different Databricks APIs&lt;/LI&gt;&lt;LI&gt;Metadata DB: Lean metadata layer to store functional details of a data domain and data products. Also, it stores particulars about the accessibility of a data product and consumers.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Central Analytics – Data Denormalization:&lt;/STRONG&gt;&lt;UL&gt;&lt;LI&gt;After building the Hub Layer, which will have the data from different domains. This data would be mostly Dimension/ Metadata specific to single domain.&lt;/LI&gt;&lt;LI&gt;In order to build the Central Analytics system, it would require to ingest the ops data separately.&lt;/LI&gt;&lt;LI&gt;This layer can be created separately on top of the Hub.&lt;/LI&gt;&lt;LI&gt;A separate schema/ catalog and Workspace can be created for this processing. This would be typically arranged in denormalized way by have Star/ Snowflake Schema on top of Dimension model.&lt;UL&gt;&lt;LI&gt;Creating a separate catalog might be require if this analytical layer would server at Enterprise level. Setting up the separate pipeline (DLT in batch mode or scheduling the job from pyspark notebook would serve the purpose here).&lt;/LI&gt;&lt;LI&gt;Creating the schema would be sufficient, if it just requires building another denormalized view of data from Domains and ingested transactional data set.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;It would require implementing the typical approach of building the SCD Type 2 on few domain data sets., which can be configured if DLT pipeline is used or built in pyspark notebook.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Wed, 23 Jul 2025 15:34:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/building-data-models-on-databricks-platform/m-p/126229#M2169</guid>
      <dc:creator>rathorer</dc:creator>
      <dc:date>2025-07-23T15:34:18Z</dc:date>
    </item>
  </channel>
</rss>

