@Louis_Frolio 

Schema Validation Framework

We built a custom schema validation framework that operates at several levels:

Pre-commit validation hooks:

  • Integrated with our Git workflow
  • Automatically extracts schema changes from DDL scripts or notebook code
  • Flags high-risk changes (column removals, type changes) for additional review
  • Ensures schema change documentation exists.

CI/CD pipeline validation:

  • Compares proposed schema with production schema
  • Classifies changes into risk categories (safe, moderate, high)
  • For high-risk changes, requires explicit approval signatures in metadata
  • Tests backward compatibility with sample queries

Tools and Implementation

The specific tools we use include:

  • Delta Lake's built-in schema utilities:

from delta.tables import DeltaTable

# Extract current schema

current_schema = DeltaTable.forPath(spark, table_path).toDF().schema

 Schema registry integration:

  1. We maintain a centralized schema registry (built on a Delta table), using a Delta table as a schema registry. This table stores records for each version of a schema used in your data pipelines or tables.
  2. All schema changes are recorded with metadata (who, when, why, approval status)
  3. Changes are versioned and linked to specific releases

Custom schema diff tooling:

  1. Compares schema versions and generates impact reports
  2. Uses Databricks Expectations framework for data validation after schema changes
  3. Automatically generates documentation of changes
LR

View solution in original post