<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Config-Driven Data Harmonization Framework in Databricks (Silver → Harmonized_Silver) in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/config-driven-data-harmonization-framework-in-databricks-silver/m-p/149449#M53106</link>
    <description>&lt;P&gt;Hi Community,&lt;/P&gt;&lt;P&gt;We are currently designing a &lt;STRONG&gt;Data Harmonization framework&lt;/STRONG&gt; in Databricks and would appreciate insights from anyone who has implemented something similar at scale.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Context&lt;/STRONG&gt;:&lt;BR /&gt;We are ingesting data from multiple source systems where:&lt;BR /&gt;- Different sources provide similar business objects&lt;BR /&gt;- Each source has different schemas and naming conventions&lt;BR /&gt;- Data types and formats vary&lt;BR /&gt;- There may be cross-source conflicts in attribute values&lt;/P&gt;&lt;P&gt;Harmonization will be implemented in the Silver Layer, creating a dedicated:&lt;BR /&gt;Silver → Harmonized_Silver Layer&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Key Design Considerations:&lt;/STRONG&gt;&lt;BR /&gt;We want the solution to be configuration-driven and reusable, not hardcoded per object.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;The framework should support:&lt;/STRONG&gt;&lt;BR /&gt;- Data Harmonization from Different Source Systems&lt;BR /&gt;- Handle different objects across multiple sources&lt;BR /&gt;- Support schema variability&lt;BR /&gt;- Harmonization in Silver Layer&lt;BR /&gt;- Transform curated Silver data into a standardized Harmonized_Silver model&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Standard Harmonization Rules (Driven by Configuration)&lt;/STRONG&gt;&lt;BR /&gt;- Similar object merging&lt;BR /&gt;- Column mapping via metadata/config tables&lt;BR /&gt;- Schema standardization across source systems&lt;BR /&gt;- Data type and format normalization&lt;BR /&gt;- Enforce data quality rules&lt;BR /&gt;- Resolve cross-source conflicts (priority rules, survivorship logic, etc.)&lt;BR /&gt;- Maintain full auditability and lineage&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Create a Generic Data Harmonization Model&lt;/STRONG&gt;&lt;BR /&gt;We are aiming to design a reusable harmonization model that:&lt;BR /&gt;- Works across domains (Customer, Product, Order, etc.)&lt;BR /&gt;- Supports schema evolution&lt;BR /&gt;- Supports incremental loads&lt;BR /&gt;- Is scalable for large datasets (100M+ records)&lt;BR /&gt;- Maintains traceability to source systems&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Questions for the Community:&lt;/STRONG&gt;&lt;BR /&gt;- Has anyone implemented a similar config-driven harmonization model in Databricks?&lt;BR /&gt;- What architecture worked best (Delta Live Tables vs structured jobs/notebooks)?&lt;BR /&gt;- How did you handle cross-source conflict resolution logic at scale?&lt;BR /&gt;- What is the best approach for maintaining lineage and auditability (Unity Catalog, custom audit tables, etc.)?&lt;BR /&gt;- Any performance challenges or anti-patterns to avoid?&lt;/P&gt;&lt;P&gt;We are targeting an enterprise-grade design and would greatly appreciate any best practices, patterns, or lessons learned.&lt;/P&gt;&lt;P&gt;Thank you.&lt;/P&gt;</description>
    <pubDate>Fri, 27 Feb 2026 05:48:00 GMT</pubDate>
    <dc:creator>Vivek_Patil1</dc:creator>
    <dc:date>2026-02-27T05:48:00Z</dc:date>
    <item>
      <title>Config-Driven Data Harmonization Framework in Databricks (Silver → Harmonized_Silver)</title>
      <link>https://community.databricks.com/t5/data-engineering/config-driven-data-harmonization-framework-in-databricks-silver/m-p/149449#M53106</link>
      <description>&lt;P&gt;Hi Community,&lt;/P&gt;&lt;P&gt;We are currently designing a &lt;STRONG&gt;Data Harmonization framework&lt;/STRONG&gt; in Databricks and would appreciate insights from anyone who has implemented something similar at scale.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Context&lt;/STRONG&gt;:&lt;BR /&gt;We are ingesting data from multiple source systems where:&lt;BR /&gt;- Different sources provide similar business objects&lt;BR /&gt;- Each source has different schemas and naming conventions&lt;BR /&gt;- Data types and formats vary&lt;BR /&gt;- There may be cross-source conflicts in attribute values&lt;/P&gt;&lt;P&gt;Harmonization will be implemented in the Silver Layer, creating a dedicated:&lt;BR /&gt;Silver → Harmonized_Silver Layer&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Key Design Considerations:&lt;/STRONG&gt;&lt;BR /&gt;We want the solution to be configuration-driven and reusable, not hardcoded per object.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;The framework should support:&lt;/STRONG&gt;&lt;BR /&gt;- Data Harmonization from Different Source Systems&lt;BR /&gt;- Handle different objects across multiple sources&lt;BR /&gt;- Support schema variability&lt;BR /&gt;- Harmonization in Silver Layer&lt;BR /&gt;- Transform curated Silver data into a standardized Harmonized_Silver model&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Standard Harmonization Rules (Driven by Configuration)&lt;/STRONG&gt;&lt;BR /&gt;- Similar object merging&lt;BR /&gt;- Column mapping via metadata/config tables&lt;BR /&gt;- Schema standardization across source systems&lt;BR /&gt;- Data type and format normalization&lt;BR /&gt;- Enforce data quality rules&lt;BR /&gt;- Resolve cross-source conflicts (priority rules, survivorship logic, etc.)&lt;BR /&gt;- Maintain full auditability and lineage&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Create a Generic Data Harmonization Model&lt;/STRONG&gt;&lt;BR /&gt;We are aiming to design a reusable harmonization model that:&lt;BR /&gt;- Works across domains (Customer, Product, Order, etc.)&lt;BR /&gt;- Supports schema evolution&lt;BR /&gt;- Supports incremental loads&lt;BR /&gt;- Is scalable for large datasets (100M+ records)&lt;BR /&gt;- Maintains traceability to source systems&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Questions for the Community:&lt;/STRONG&gt;&lt;BR /&gt;- Has anyone implemented a similar config-driven harmonization model in Databricks?&lt;BR /&gt;- What architecture worked best (Delta Live Tables vs structured jobs/notebooks)?&lt;BR /&gt;- How did you handle cross-source conflict resolution logic at scale?&lt;BR /&gt;- What is the best approach for maintaining lineage and auditability (Unity Catalog, custom audit tables, etc.)?&lt;BR /&gt;- Any performance challenges or anti-patterns to avoid?&lt;/P&gt;&lt;P&gt;We are targeting an enterprise-grade design and would greatly appreciate any best practices, patterns, or lessons learned.&lt;/P&gt;&lt;P&gt;Thank you.&lt;/P&gt;</description>
      <pubDate>Fri, 27 Feb 2026 05:48:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/config-driven-data-harmonization-framework-in-databricks-silver/m-p/149449#M53106</guid>
      <dc:creator>Vivek_Patil1</dc:creator>
      <dc:date>2026-02-27T05:48:00Z</dc:date>
    </item>
    <item>
      <title>Re: Config-Driven Data Harmonization Framework in Databricks (Silver → Harmonized_Silver)</title>
      <link>https://community.databricks.com/t5/data-engineering/config-driven-data-harmonization-framework-in-databricks-silver/m-p/150086#M53233</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/171303"&gt;@Vivek_Patil1&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;Great question -- this is a pattern we see frequently in enterprise data platforms, especially in healthcare and financial services where multi-source harmonization is critical. Here is a comprehensive architecture recommendation using native Databricks capabilities.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;RECOMMENDED ARCHITECTURE: LAKEFLOW SPARK DECLARATIVE PIPELINES + CONFIG TABLES&lt;/P&gt;
&lt;P&gt;I recommend Lakeflow Spark Declarative Pipelines (SDP, formerly Delta Live Tables) as the backbone of your framework. SDP gives you declarative data quality, automatic dependency resolution, built-in lineage via Unity Catalog, and -- critically -- the ability to dynamically generate tables from configuration using Python metaprogramming. This is the key to making it config-driven rather than hardcoded.&lt;/P&gt;
&lt;P&gt;Here is how to structure it:&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;1. CONFIGURATION LAYER: DELTA TABLES AS YOUR FRAMEWORK METADATA&lt;/P&gt;
&lt;P&gt;Store your harmonization rules in Delta tables in Unity Catalog. This makes your config queryable, versionable (via Delta time travel), and accessible from pipeline code at runtime.&lt;/P&gt;
&lt;P&gt;harmonization_config.column_mappings -- maps source columns to harmonized target columns:&lt;/P&gt;
&lt;P&gt;source_system | source_object | source_column | target_object | target_column | data_type | transformation_expr | priority&lt;BR /&gt;salesforce | account | acct_name | customer | customer_name | STRING | TRIM(UPPER({col})) | 1&lt;BR /&gt;sap | kna1 | name1 | customer | customer_name | STRING | TRIM(UPPER({col})) | 2&lt;/P&gt;
&lt;P&gt;harmonization_config.quality_rules -- data quality expectations per target object:&lt;/P&gt;
&lt;P&gt;target_object | rule_name | rule_expression | action&lt;BR /&gt;customer | valid_customer_name | customer_name IS NOT NULL | drop&lt;BR /&gt;customer | valid_email_format | email RLIKE '^[^@]+@[^@]+$' | warn&lt;/P&gt;
&lt;P&gt;harmonization_config.survivorship_rules -- cross-source conflict resolution:&lt;/P&gt;
&lt;P&gt;target_object | target_column | resolution_strategy | priority_order&lt;BR /&gt;customer | customer_name | SOURCE_PRIORITY | salesforce,sap,oracle&lt;BR /&gt;customer | phone | MOST_RECENT | NULL&lt;BR /&gt;customer | email | MOST_COMPLETE | NULL&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;2. DYNAMIC PIPELINE GENERATION WITH PYTHON METAPROGRAMMING&lt;/P&gt;
&lt;P&gt;This is the most powerful pattern for config-driven pipelines. SDP supports creating tables dynamically in Python for loops. Combined with pipeline parameters accessible via spark.conf.get(), you can build a fully generic framework.&lt;/P&gt;
&lt;P&gt;Here is the core pattern:&lt;/P&gt;
&lt;P&gt;from pyspark import pipelines as dp&lt;BR /&gt;from pyspark.sql import functions as F&lt;/P&gt;
&lt;P&gt;# Read config at pipeline initialization&lt;BR /&gt;config_df = spark.table("harmonization_config.column_mappings")&lt;BR /&gt;target_objects = [row.target_object for row in config_df.select("target_object").distinct().collect()]&lt;/P&gt;
&lt;P&gt;# Read quality rules into a dictionary for expect_all&lt;BR /&gt;quality_df = spark.table("harmonization_config.quality_rules")&lt;/P&gt;
&lt;P&gt;for target_obj in target_objects:&lt;BR /&gt;obj_mappings = config_df.filter(F.col("target_object") == target_obj).collect()&lt;/P&gt;
&lt;P&gt;# Build expectations dict for drop vs warn&lt;BR /&gt;drop_rules = {&lt;BR /&gt;row.rule_name: row.rule_expression&lt;BR /&gt;for row in quality_df.filter(&lt;BR /&gt;(F.col("target_object") == target_obj) &amp;amp; (F.col("action") == "drop")&lt;BR /&gt;).collect()&lt;BR /&gt;}&lt;BR /&gt;warn_rules = {&lt;BR /&gt;row.rule_name: row.rule_expression&lt;BR /&gt;for row in quality_df.filter(&lt;BR /&gt;(F.col("target_object") == target_obj) &amp;amp; (F.col("action") == "warn")&lt;BR /&gt;).collect()&lt;BR /&gt;}&lt;/P&gt;
&lt;P&gt;# CRITICAL: Use default parameter binding to avoid late-binding closure issues&lt;BR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25059"&gt;@DP&lt;/a&gt;.materialized_view(&lt;BR /&gt;name=f"harmonized_silver.{target_obj}",&lt;BR /&gt;comment=f"Harmonized view of {target_obj} from multiple sources"&lt;BR /&gt;)&lt;BR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25059"&gt;@DP&lt;/a&gt;.expect_all(warn_rules)&lt;BR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25059"&gt;@DP&lt;/a&gt;.expect_all_or_drop(drop_rules)&lt;BR /&gt;def create_harmonized_table(&lt;BR /&gt;_target=target_obj,&lt;BR /&gt;_mappings=obj_mappings&lt;BR /&gt;&lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt;&lt;BR /&gt;source_dfs = []&lt;BR /&gt;for mapping_group in _group_by_source(_mappings):&lt;BR /&gt;source_sys = mapping_group[0].source_system&lt;BR /&gt;source_obj = mapping_group[0].source_object&lt;/P&gt;
&lt;P&gt;source_df = spark.table(f"silver.{source_sys}_{source_obj}")&lt;/P&gt;
&lt;P&gt;select_exprs = []&lt;BR /&gt;for m in mapping_group:&lt;BR /&gt;if m.transformation_expr:&lt;BR /&gt;expr = m.transformation_expr.replace("{col}", m.source_column)&lt;BR /&gt;select_exprs.append(F.expr(expr).cast(m.data_type).alias(m.target_column))&lt;BR /&gt;else:&lt;BR /&gt;select_exprs.append(F.col(m.source_column).cast(m.data_type).alias(m.target_column))&lt;/P&gt;
&lt;P&gt;select_exprs.append(F.lit(source_sys).alias("_source_system"))&lt;BR /&gt;select_exprs.append(F.current_timestamp().alias("_harmonized_at"))&lt;BR /&gt;source_dfs.append(source_df.select(*select_exprs))&lt;/P&gt;
&lt;P&gt;harmonized = source_dfs[0]&lt;BR /&gt;for df in source_dfs[1:]:&lt;BR /&gt;harmonized = harmonized.unionByName(df, allowMissingColumns=True)&lt;/P&gt;
&lt;P&gt;return harmonized&lt;/P&gt;
&lt;P&gt;IMPORTANT: Note the use of default parameter binding (_target=target_obj) in the function signature. This is required to avoid Python's late-binding closure issue where all loop iterations would reference the last value. This is explicitly documented in the Python development guide.&lt;/P&gt;
&lt;P&gt;Docs: &lt;A href="https://docs.databricks.com/en/delta-live-tables/python-dev.html" target="_blank"&gt;https://docs.databricks.com/en/delta-live-tables/python-dev.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;3. CROSS-SOURCE CONFLICT RESOLUTION&lt;/P&gt;
&lt;P&gt;For survivorship logic (choosing which source's value "wins" when they conflict), use a dedicated resolution layer as a second materialized view:&lt;/P&gt;
&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25059"&gt;@DP&lt;/a&gt;.materialized_view(name="harmonized_silver.customer_resolved")&lt;BR /&gt;def resolve_customer():&lt;BR /&gt;harmonized = spark.table("LIVE.harmonized_silver.customer")&lt;BR /&gt;survivorship = spark.table("harmonization_config.survivorship_rules")&lt;/P&gt;
&lt;P&gt;from pyspark.sql.window import Window&lt;/P&gt;
&lt;P&gt;# Rank records per entity key by source priority&lt;BR /&gt;w = Window.partitionBy("customer_id").orderBy("_source_priority")&lt;/P&gt;
&lt;P&gt;return (&lt;BR /&gt;harmonized&lt;BR /&gt;.join(priority_df, on="source_system")&lt;BR /&gt;.withColumn("_rank", F.row_number().over(w))&lt;BR /&gt;.filter(F.col("_rank") == 1)&lt;BR /&gt;.drop("_rank", "_source_priority")&lt;BR /&gt;)&lt;/P&gt;
&lt;P&gt;For more advanced survivorship (e.g., "take the most recent non-null value per column"), use COALESCE with window functions ordered by recency within each source.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;4. DATA QUALITY WITH EXPECTATIONS&lt;/P&gt;
&lt;P&gt;SDP's expectations framework is perfect for config-driven quality rules. The expect_all, expect_all_or_drop, and expect_all_or_fail decorators accept Python dictionaries, so you can load them directly from your config tables:&lt;/P&gt;
&lt;P&gt;quality_rules = {&lt;BR /&gt;"valid_customer_name": "customer_name IS NOT NULL",&lt;BR /&gt;"valid_email": "email RLIKE '^[^@]+@[^@]+$'",&lt;BR /&gt;"valid_country_code": "country_code IN ('US','UK','DE','FR')"&lt;BR /&gt;}&lt;/P&gt;
&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25059"&gt;@DP&lt;/a&gt;.materialized_view(name="harmonized_silver.customer")&lt;BR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25059"&gt;@DP&lt;/a&gt;.expect_all_or_drop(quality_rules)&lt;BR /&gt;def customer_harmonized():&lt;BR /&gt;...&lt;/P&gt;
&lt;P&gt;Quality metrics are automatically tracked in the pipeline UI and event log -- no custom audit tables needed for DQ monitoring.&lt;/P&gt;
&lt;P&gt;Docs: &lt;A href="https://docs.databricks.com/en/delta-live-tables/expectations.html" target="_blank"&gt;https://docs.databricks.com/en/delta-live-tables/expectations.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;5. LINEAGE AND AUDITABILITY&lt;/P&gt;
&lt;P&gt;Unity Catalog provides automatic table-level and column-level lineage tracking across your entire pipeline:&lt;/P&gt;
&lt;P&gt;- Table-level lineage: Automatically captured for all SDP pipeline operations&lt;BR /&gt;- Column-level lineage: Available on Databricks Runtime 13.3 LTS+&lt;BR /&gt;- Lineage retention: 1 year of history, visible across workspaces sharing the same metastore&lt;/P&gt;
&lt;P&gt;For additional audit tracking, add metadata columns to every harmonized table:&lt;/P&gt;
&lt;P&gt;.withColumn("_source_system", F.lit(source_system))&lt;BR /&gt;.withColumn("_source_table", F.lit(source_table))&lt;BR /&gt;.withColumn("_harmonized_at", F.current_timestamp())&lt;BR /&gt;.withColumn("_pipeline_id", F.lit(spark.conf.get("pipelines.id", "unknown")))&lt;/P&gt;
&lt;P&gt;Docs: &lt;A href="https://docs.databricks.com/en/data-governance/unity-catalog/data-lineage.html" target="_blank"&gt;https://docs.databricks.com/en/data-governance/unity-catalog/data-lineage.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;6. SCHEMA EVOLUTION&lt;/P&gt;
&lt;P&gt;Delta Lake natively supports schema evolution:&lt;/P&gt;
&lt;P&gt;- Use mergeSchema for additive changes (new columns from source systems)&lt;BR /&gt;- SDP materialized views automatically handle schema changes on refresh&lt;BR /&gt;- For column renaming/dropping, enable column mapping on your Delta tables&lt;BR /&gt;- Store your expected schema in config tables and validate against it as a quality check&lt;/P&gt;
&lt;P&gt;Docs: &lt;A href="https://docs.databricks.com/en/delta/update-schema.html" target="_blank"&gt;https://docs.databricks.com/en/delta/update-schema.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;7. INCREMENTAL PROCESSING AT SCALE (100M+ RECORDS)&lt;/P&gt;
&lt;P&gt;For large-scale incremental loads:&lt;/P&gt;
&lt;P&gt;- Use streaming tables with spark.readStream for append-only ingestion from Silver into Harmonized_Silver&lt;BR /&gt;- Use AUTO CDC (formerly APPLY CHANGES INTO) for handling updates/deletes with SCD Type 1 or Type 2. This is especially powerful for handling out-of-order events from multiple sources.&lt;BR /&gt;- For pure batch with incremental refresh, materialized views automatically detect and process only changed data&lt;/P&gt;
&lt;P&gt;Docs: &lt;A href="https://docs.databricks.com/en/delta-live-tables/cdc.html" target="_blank"&gt;https://docs.databricks.com/en/delta-live-tables/cdc.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;8. SDP VS. STRUCTURED NOTEBOOKS&lt;/P&gt;
&lt;P&gt;To directly answer your architecture question:&lt;/P&gt;
&lt;P&gt;SDP (Recommended):&lt;BR /&gt;- Data quality: Built-in expectations with metrics&lt;BR /&gt;- Lineage: Automatic via Unity Catalog&lt;BR /&gt;- Dependency management: Automatic DAG resolution&lt;BR /&gt;- Dynamic table generation: Python metaprogramming in loops&lt;BR /&gt;- Schema evolution: Automatic handling&lt;BR /&gt;- Monitoring: Built-in pipeline UI&lt;BR /&gt;- Incremental processing: Native streaming + AUTO CDC&lt;/P&gt;
&lt;P&gt;Structured Notebooks:&lt;BR /&gt;- Data quality: Must build custom&lt;BR /&gt;- Lineage: Manual tracking&lt;BR /&gt;- Dependency management: Manual orchestration&lt;BR /&gt;- Dynamic table generation: Full flexibility&lt;BR /&gt;- Schema evolution: Manual&lt;BR /&gt;- Monitoring: Custom dashboards&lt;BR /&gt;- Incremental processing: Manual checkpointing&lt;/P&gt;
&lt;P&gt;SDP is the better choice for this use case because the declarative approach with automatic dependency resolution, built-in data quality expectations, and native lineage integration directly address your requirements.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;9. ANTI-PATTERNS TO AVOID&lt;/P&gt;
&lt;P&gt;1. Hardcoding transformations per source -- Use config tables and dynamic generation instead&lt;BR /&gt;2. Storing config in notebooks or JSON files -- Use Delta tables for versioning, querying, and sharing&lt;BR /&gt;3. Late-binding closures in Python loops -- Always use default parameter binding when creating tables in loops&lt;BR /&gt;4. Processing full datasets every run -- Use streaming tables or AUTO CDC for incremental processing&lt;BR /&gt;5. Ignoring data quality until Gold layer -- Apply expectations at Harmonized_Silver to catch issues early&lt;BR /&gt;6. Single monolithic pipeline -- Split by domain (Customer, Product, Order) for independent scaling and failure isolation&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;DOCUMENTATION REFERENCES&lt;/P&gt;
&lt;P&gt;- Medallion Architecture: &lt;A href="https://docs.databricks.com/en/lakehouse/medallion.html" target="_blank"&gt;https://docs.databricks.com/en/lakehouse/medallion.html&lt;/A&gt;&lt;BR /&gt;- Lakeflow Spark Declarative Pipelines: &lt;A href="https://docs.databricks.com/en/delta-live-tables/index.html" target="_blank"&gt;https://docs.databricks.com/en/delta-live-tables/index.html&lt;/A&gt;&lt;BR /&gt;- Develop pipeline code with Python: &lt;A href="https://docs.databricks.com/en/delta-live-tables/python-dev.html" target="_blank"&gt;https://docs.databricks.com/en/delta-live-tables/python-dev.html&lt;/A&gt;&lt;BR /&gt;- Use parameters with pipelines: &lt;A href="https://docs.databricks.com/en/delta-live-tables/parameters.html" target="_blank"&gt;https://docs.databricks.com/en/delta-live-tables/parameters.html&lt;/A&gt;&lt;BR /&gt;- Manage data quality with expectations: &lt;A href="https://docs.databricks.com/en/delta-live-tables/expectations.html" target="_blank"&gt;https://docs.databricks.com/en/delta-live-tables/expectations.html&lt;/A&gt;&lt;BR /&gt;- Change data capture with AUTO CDC: &lt;A href="https://docs.databricks.com/en/delta-live-tables/cdc.html" target="_blank"&gt;https://docs.databricks.com/en/delta-live-tables/cdc.html&lt;/A&gt;&lt;BR /&gt;- Update Delta Lake table schema: &lt;A href="https://docs.databricks.com/en/delta/update-schema.html" target="_blank"&gt;https://docs.databricks.com/en/delta/update-schema.html&lt;/A&gt;&lt;BR /&gt;- Unity Catalog data lineage: &lt;A href="https://docs.databricks.com/en/data-governance/unity-catalog/data-lineage.html" target="_blank"&gt;https://docs.databricks.com/en/data-governance/unity-catalog/data-lineage.html&lt;/A&gt;&lt;BR /&gt;- Configure a pipeline: &lt;A href="https://docs.databricks.com/en/delta-live-tables/configure-pipeline.html" target="_blank"&gt;https://docs.databricks.com/en/delta-live-tables/configure-pipeline.html&lt;/A&gt;&lt;BR /&gt;- Databricks Asset Bundles: &lt;A href="https://docs.databricks.com/en/dev-tools/bundles/index.html" target="_blank"&gt;https://docs.databricks.com/en/dev-tools/bundles/index.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Hope this helps -- happy to dive deeper into any specific aspect of this architecture!&lt;/P&gt;
&lt;P&gt;* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.&lt;/P&gt;</description>
      <pubDate>Sat, 07 Mar 2026 20:13:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/config-driven-data-harmonization-framework-in-databricks-silver/m-p/150086#M53233</guid>
      <dc:creator>SteveOstrowski</dc:creator>
      <dc:date>2026-03-07T20:13:45Z</dc:date>
    </item>
  </channel>
</rss>

