Databricks Community

Ashwin_DSA · ‎05-18-2026

You've spent months planning the migration, the pipelines are built, and data is flowing into the new platform. Then comes the hard part: proving the data actually matches.

Business validation and reconciliation is where data migrations fail. Not because the work is technically impossible, but because it's slow, manual, and brittle. Keeping two systems in lockstep during a cutover is a hard problem on its own. A migration compounds it: you're not just keeping systems in sync, you're proving they match at scale across every business domain. Teams end up comparing row counts in spreadsheets, eyeballing aggregates across systems, and fielding the same question from stakeholders week after week: "Can we trust this data?"

I've done this many times across large-scale data warehouse migrations, and two stand out. One was at an insurance company migrating billions of records from a legacy data model, and another at a retailer moving terabytes from a legacy planning platform. Both taught me the same lesson: the teams that treat validation as an afterthought end up burning more time on reconciliation than on the migration itself.

Databricks ships Lakebridge, an open-source migration toolkit that handles code conversion and baseline reconciliation (row counts, schema, column-level checks). It's the right starting point for most migrations. This post is about what comes next: the business-specific validation, governance, and stakeholder-facing dashboards that every real migration ends up needing on top.

The five practices below are the patterns I'd build in from day one, and all of them leverage what Databricks SQL and the Lakehouse give you out of the box. Validation becomes part of the migration toolkit instead of a side project bolted on after cutover.

KEY TAKEAWAYS

Treat validation as a first-class part of the migration architecture, not a QA phase tacked on after cutover.
Build it in DBSQL from day one: SQL scripting for checks, Unity Catalog for lineage, AI/BI dashboards for visibility, Jobs for scheduling.
A reconciliation cycle that took six weeks of spreadsheet work now runs in under an hour on a schedule, with every stakeholder seeing results in real time.

What's in this post

The pain: why reconciliation drags
Standardize the validation framework with SQL scripting
Use EXCEPT and hash-based comparisons
Enable Unity Catalog for lineage and governance
Build a reconciliation dashboard with AI/BI
Automate and schedule validation runs
What this changes for the business
Validation is the migration, not a phase after it

The pain: why reconciliation drags

Before getting into the tips, let me paint the picture.

At the insurance company, we were migrating billions of records from a legacy data model to a modern data platform, and reconciliation had to happen at both the policy level and the claims level. Every module had its own validation approach: some teams wrote scripts, others manually compared exports, and the results landed in spreadsheets that were impossible to aggregate or audit. When a discrepancy surfaced, nobody could tell whether it was a data issue, a transformation bug, or a stale comparison.

At the retailer the setup was different but the outcome was similar. Terabytes of data were moving from a legacy planning platform to a new end-to-end solution, with Databricks as the migration engine. The platform choice was right, but the team didn't lean into it. Dozens of people worked across multiple modules, each writing their own profiling, validations, and reconciliation logic, with no shared catalog, no lineage, and no consistency. Results leaked into spreadsheets for stakeholders, some teams stood up Power BI just to prove row counts matched, and because Unity Catalog wasn't enabled, nobody could answer "who validated what" without a Slack archaeology dig.

Both projects got to the finish line, but reconciliation took far longer than it should have. Here's what I'd do differently in both cases.

Tip 1: Standardize the validation framework with SQL scripting

One of the biggest time sinks in both projects was inconsistency, because each team member had their own approach to validation. Some checked row counts, others checked aggregates, and a few did column-level comparisons, but nobody did all three in the same way.

Databricks SQL now supports SQL scripting, which gives you stored procedures with variables, loops, and conditional logic. This lets you build a reusable validation framework that every team member runs against every module.

Here's a simplified example of a parameterized reconciliation procedure:

CREATE OR REPLACE PROCEDURE validate_migration(
  source_table STRING,
  target_table STRING,
  key_columns ARRAY<STRING>,
  check_columns ARRAY<STRING>
)
LANGUAGE SQL
SQL SECURITY INVOKER
AS BEGIN
  DECLARE source_count BIGINT;
  DECLARE target_count BIGINT;

  -- Step 1: Row count comparison
  SET source_count = (SELECT COUNT(*) FROM IDENTIFIER(source_table));
  SET target_count = (SELECT COUNT(*) FROM IDENTIFIER(target_table));

  INSERT INTO migration_validation.results
  VALUES (source_table, target_table, 'row_count',
          source_count, target_count,
          source_count = target_count,
          CURRENT_TIMESTAMP());

  -- Step 2: Aggregate comparison for numeric columns
  FOR col AS SELECT EXPLODE(check_columns) AS col_name DO
    INSERT INTO migration_validation.results
    SELECT
      source_table, target_table,
      CONCAT('sum_', col.col_name),
      (SELECT SUM(IDENTIFIER(col.col_name)) FROM IDENTIFIER(source_table)),
      (SELECT SUM(IDENTIFIER(col.col_name)) FROM IDENTIFIER(target_table)),
      (SELECT SUM(IDENTIFIER(col.col_name)) FROM IDENTIFIER(source_table)) =
      (SELECT SUM(IDENTIFIER(col.col_name)) FROM IDENTIFIER(target_table)),
      CURRENT_TIMESTAMP();
  END FOR;
END;

The key insight is that validation logic should be code rather than a manual process. When it's a stored procedure, it's versioned, repeatable, and auditable; every team member calls the same procedure, every result lands in the same table, and the days of comparing screenshots in a meeting are over. At the insurance company, a standardized procedure like this would have eliminated weeks of back-and-forth on whether policy-level and claims-level validations were even comparable.

Tip 2: Use EXCEPT and hash-based comparisons for row-level mismatches

Row counts match and aggregates match, but are the rows actually identical? This is where most manual processes fall apart, because eyeballing samples doesn't scale to billions of records.

There's a temptation to stand up a separate data-quality system for this kind of check, but that moves validation away from where the data lives and adds another platform to operate. You don't need it: Databricks SQL gives you two approaches that work at scale.

Approach 1: EXCEPT for exact-match detection

-- Rows in source but not in target
SELECT * FROM source_policies
EXCEPT
SELECT * FROM target_policies;

-- Rows in target but not in source
SELECT * FROM target_policies
EXCEPT
SELECT * FROM source_policies;

Simple, readable, and handles schema-wide comparisons. If both queries return zero rows, the data matches exactly.

Approach 2: Hash-based comparison for large tables

When tables have billions of rows, computing a hash per row and comparing hashes is more efficient than a full EXCEPT:

-- Generate row-level hashes
CREATE OR REPLACE VIEW source_hashed AS
SELECT
  policy_id,
  MD5(CONCAT_WS('|',
    CAST(policy_id AS STRING),
    CAST(effective_date AS STRING),
    CAST(premium_amount AS STRING),
    CAST(status AS STRING)
  )) AS row_hash
FROM source_policies;

CREATE OR REPLACE VIEW target_hashed AS
SELECT
  policy_id,
  MD5(CONCAT_WS('|',
    CAST(policy_id AS STRING),
    CAST(effective_date AS STRING),
    CAST(premium_amount AS STRING),
    CAST(status AS STRING)
  )) AS row_hash
FROM target_policies;

-- Find mismatches
SELECT s.policy_id, s.row_hash AS source_hash, t.row_hash AS target_hash
FROM source_hashed s
FULL OUTER JOIN target_hashed t ON s.policy_id = t.policy_id
WHERE s.row_hash != t.row_hash
   OR s.policy_id IS NULL
   OR t.policy_id IS NULL;

The hash approach gives you three things that EXCEPT doesn't: you can identify which rows differ rather than just that they differ, you can join back to the source to see what changed, and you can store hashes for incremental comparison on subsequent runs. At the retailer, a single SQL-based pattern in DBSQL would have unified an approach that was otherwise fragmented across Python, notebooks, and external tools, and made results comparable across modules.

Tip 3: Enable Unity Catalog for lineage and governance from day one

This is the tip I wish I could go back in time and give the retailer's migration team.

When Unity Catalog is enabled, every table, view, and query in the migration workspace gets automatic lineage tracking. You can see which source tables feed which target tables, trace a discrepancy back to the transformation that caused it, and answer "who validated this table and when?" without digging through Slack messages. Without it, basic questions go unanswered: which source tables have been fully validated, which modules are blocked on upstream issues, and did someone already profile this column or do I need to start from scratch?

You can also tag tables with custom metadata to track validation status:

-- Tag tables with validation status using Unity Catalog tags
ALTER TABLE migration_validation.target_policies
SET TAGS ('validation_status' = 'passed', 'validated_by' = 'ashwin', 'validated_date' = '2026-04-10');

-- Query validation status across all tables
SELECT
  table_name,
  tag_name,
  tag_value
FROM system.information_schema.table_tags
WHERE schema_name = 'migration_validation'
  AND tag_name LIKE 'validation%';

Column-level lineage matters when debugging why a particular aggregate doesn't match: instead of tracing the transformation manually, you click through the lineage graph and see exactly which upstream columns contribute to the target. The result is full auditability of the migration with zero extra code.

Tip 4: Build a reconciliation dashboard with AI/BI dashboards

At both projects, validation results were consumed through spreadsheets, and at the retailer some teams went further and built Power BI reports just to show whether data matched. This is exactly what Databricks AI/BI dashboards solve, because they're native to the platform, query DBSQL directly, and require no additional infrastructure.

If validation results are already in a table (and after Tip 1, they are), building a migration health dashboard takes minutes:

Dataset: Validation summary

SELECT
  source_table,
  target_table,
  check_type,
  source_value,
  target_value,
  is_match,
  checked_at,
  CASE
    WHEN is_match THEN 'Passed'
    ELSE 'Failed'
  END AS status
FROM migration_validation.results
ORDER BY checked_at DESC;

From this single dataset, you can build a scorecard showing pass/fail rates across modules, a trend chart showing validation progress over time, a detail table filtered to failures so teams can drill into what needs attention, and module-level breakdowns so each workstream sees their own status.

Here's what it looks like in practice: a 60-day insurance data warehouse migration validating 82M rows across five modules.

*Figure 1. Migration health at a glance. 82M rows across 5 modules, 9,000 validation runs over 60 days. The claims crisis is visible in orange; green bars on the right show current module pass rates.*

The dashboard becomes the single source of truth for migration status. Stakeholders stop asking "are we on track?" in meetings because they can check the dashboard, and teams stop building their own reporting because the answers are already there. The dashboard queries the same tables your validation procedures write to, so there are no spreadsheets, no standalone BI tool, and no exports.

Tip 5: Automate and schedule validation runs with Databricks Jobs

Validation isn't a one-time activity, because data keeps flowing, transformations get updated, and what passed yesterday might fail today.

The final tip is to wrap validation procedures into Databricks Jobs and SQL task orchestration and schedule them on a cadence: daily during active migration, hourly during cutover.

-- Run all validations for a module
CALL validate_migration(
  'legacy.policies',
  'lakehouse.policies',
  ARRAY('policy_id'),
  ARRAY('premium_amount', 'coverage_amount')
);

CALL validate_migration(
  'legacy.claims',
  'lakehouse.claims',
  ARRAY('claim_id'),
  ARRAY('claim_amount', 'paid_amount')
);

Schedule these calls as a SQL task in a Databricks Job, and set up alerts on the validation results table so that when a check flips from pass to fail, the team gets notified immediately rather than at the next standup. This converts validation from a manual gate into a continuous process: at the insurance company, scheduled validation would have caught wave-to-wave regressions automatically rather than requiring a fresh manual cycle each time. The practical outcome of automating this loop is that a six-week reconciliation pass becomes a one-hour automated run, every hour.

This matters even more in incremental migrations, because most real-world data warehouse moves follow a Strangler Fig pattern, replacing one workload at a time while the legacy system keeps running. Scheduled validation is what keeps both systems honest with each other during that overlap.

The combination of SQL scripting (Tip 1) and scheduled jobs (Tip 5) gives you a fully automated validation pipeline, with results flowing into the dashboard from Tip 4.

What this changes for the business

Take one validation a CFO actually cares about: total written premium by line of business reconciling between legacy and lakehouse before sign-off.

Before this framework, the cycle was a SQL extract on the legacy system grouped by LOB exported to CSV, the same query on the new platform exported to a second CSV, both pulled into Excel for VLOOKUP and manual reconciliation, and the result emailed to the steerco. Two analysts, two to three hours each round, with sign-off blocked behind whoever owned the latest spreadsheet.

With the framework, the same check becomes a single procedure call, the result writes to validation_results with timestamp and validator, and the AI/BI dashboard tile reflects pass or fail per LOB live. The steerco opens the dashboard rather than chasing an attachment, and the reconciliation moves from a manual artefact to a governed, auditable record that anyone can re-run.

The check is the same. The cycle time collapses, the audit trail becomes automatic, and business stakeholders see status without anyone preparing it.

Validation is the migration, not a phase after it

Most data warehouse migrations don't fail in the pipelines, they stall in the proof. The teams that ship on time aren't the ones with better ETL; they're the ones who built the validation framework on day one, in the same platform they're migrating to.

So stop treating reconciliation as the thing you figure out after cutover, stop exporting row counts to spreadsheets, and stop standing up a separate BI tool just to show whether data matches. The five practices in this post (SQL scripting, hash comparisons, Unity Catalog lineage, AI/BI dashboards, and scheduled jobs) work as one system because they live on the same lakehouse. Validation writes to tables, lineage tracks them, dashboards read them, and jobs schedule the writes. The payoff isn't faster validation: it's a SQL migration where "can we trust this data?" is answered on a dashboard before anyone asks.

Ready to try this yourself? Start with a Databricks free trial and explore SQL scripting, Unity Catalog, and AI/BI dashboards firsthand.

Have questions or want to share how you handle validation in your migrations? Drop a comment below. I'd love to hear about it.

Databricks Community

Speed Up Data Warehouse Migration Validation

The pain: why reconciliation drags

Tip 1: Standardize the validation framework with SQL scripting

Tip 2: Use EXCEPT and hash-based comparisons for row-level mismatches

Approach 1: EXCEPT for exact-match detection

Approach 2: Hash-based comparison for large tables

Tip 3: Enable Unity Catalog for lineage and governance from day one

Tip 4: Build a reconciliation dashboard with AI/BI dashboards

Dataset: Validation summary

Tip 5: Automate and schedule validation runs with Databricks Jobs

What this changes for the business

Validation is the migration, not a phase after it

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks