cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Goo
Databricks Employee
Databricks Employee

R1 is a leading provider of revenue management solutions for healthcare organizations, supporting hospitals, health systems, and physician groups across front-, middle-, and back-office revenue operations. Through its Phare Revenue Operating System, R1 combines AI, automation, analytics, and operational expertise to help providers improve financial performance, reduce administrative complexity, and accelerate revenue flow. Built around a platform approach, R1’s technology is designed to deliver real-time operational intelligence and workflow orchestration across the revenue cycle, from access and coding through claims management, denials, and reimbursement optimization.

  • 12 weeks.
  • 847 dbt-models migrated.
  • 35 Information Mart Tables.
  • Cost per run reduced by roughly 77%.

These were the results of R1's migration of its healthcare revenue cycle Data Vault from Snowflake to Databricks, led by R1's Zheng Zhu and delivered with Lovelytics.

 

The Revenue Cycle Pipelines

Healthcare revenue cycle analytics is a workload that must run multiple times a day across hundreds of billions of records, with the business watching the output. R1 runs a full Data Vault pipeline for these 847 dbt models across five layers, built on dbt Core, Delta Lake, and Unity Catalog on Databricks. 

The final Information Mart layer, the output consumed by R1's analytics and reporting, comprises 35 tables representing the distilled, business-ready view of the entire revenue cycle.

R1's mandate to Lovelytics was clear: deliver it on time, validate the code's behavior, and maintain synchronization with the current codebase to ensure continuity of operations. The migration was delivered in 12 weeks on a 2XL SQL Warehouse, achieving a 77% reduction in per-run DBU costs.

 

Why does this kind of migration fail when it fails?

Migrations slip by quarters when teams treat the work as a translation project rather than an engineering one. The SQL itself is rarely the hard part.

Structural differences between the platforms do most of the damage, and they surface in any sufficiently large warehouse. Below are challenges that R1 and Lovelytics had to address:

  • Continuity of operations: R1's upstream Snowflake repo continued to receive live changes during the migration, so the team had to keep production running while refactoring.
  • Inherited complexity: The team migrated and refactored simultaneously. Enterprise pipelines accumulate interdependencies over the years, some of which only surface under a different execution engine; NULL ordering defaults, implicit type coercion rules, timestamp/timezone handling, and window-function frame semantics are typical offenders. Finding and resolving those issues mid-migration, without breaking the codebase's expected behavior, is the bulk of the engineering work.
  • Non-portable hash functions: Snowflake's HASH() function is non-portable; no Databricks equivalent produces the same value, so production surrogate keys differ across platforms by design. The entire testing strategy had to be built around that.
  • Snowflake does not enforce unique merge keys: Delta MERGE requires at most one source row per target key, while Snowflake's MERGE accepts duplicates and resolves them non-deterministically. The fix, QUALIFY + ROW_NUMBER OVER (PARTITION BY key ORDER BY …) upstream of every merge, had to be applied across hundreds of models.
  • A maze of dependencies: R1's Information Mart Transient layer has 460 models with 735 internal dependency edges. Some models fan out to 100+ downstream consumers, so running it safely in parallel required generating explicit wave-ordered build selectors.
  • Three simultaneous data sources: Not all of R1's source data was in Unity Catalog on day one; the pipeline needed to read from Delta tables outside Unity Catalog, Delta tables inside Unity Catalog, and live federated Snowflake — simultaneously, without forking the codebase.
  • Federated Snowflake is the bottleneck: Predicate pushdown is poor, which is why the team loaded static copies into Delta rather than continuing to read via federation.

 

What R1 and the Lovelytics team did differently

Key architectural and technical decisions that mattered

The team treated data engineering like software engineering: tested the transformation logic itself, and built every validation layer as a first-class deliverable.

Most migration validation stops at row counts and aggregate comparisons, which confirms the data moved, but not that the logic behind it is correct. For this migration, each dbt model was unit tested in isolation with fixed inputs and asserted outputs, so the team could validate the logic independently of the current production data.

That distinction matters. A model can match Snowflake on today’s data and still contain logic that breaks when the data changes. Unit testing the transformation gives confidence that the model is correct, not just that it happens to agree with Snowflake on a specific run. For a healthcare revenue cycle pipeline running multiple times a day.

The design pattern used was a routing macro that resolved every cross-layer reference at runtime to one of three targets: Delta outside Unity Catalog, Delta inside Unity Catalog, or federated Snowflake. A dbt variable controlled the mode for each run, allowing the same codebase to execute against all three sources without maintaining separate branches.

Goo_0-1779290027537.png

dbt variables control which mode is active for each run. A code generation script refreshes the Jinja allowlist whenever the Unity Catalog source list changes. The pattern is simple: each model reference checks Unity Catalog and Databricks first, then falls back to federated Snowflake if the data hasn’t moved. As each source migrates, the same model automatically starts reading from Databricks without requiring code changes or branch-specific rewrites.

“For revenue cycle analytics, correctness is not theoretical—the business depends on these pipelines for mission-critical use cases. The challenge was preserving the expected behavior and our ability to iterate, while the business logic and implementation continued to evolve, and that the target platform introduced different execution semantics around hash generation, merge behavior, dependency ordering, and incremental processing. The migration succeeded because we treated rigorous model-level testing, runtime routing, wave-based dependency management, and reconciliation on the data concepts as core engineering work from day one—not as cleanup after the SQL was translated.” — Zheng Zhu, R1

Three hash key types, three strategies:

  • Snowflake HASH() surrogates: Not reproducible on Databricks. Replaced with MD5(CONCAT()); production keys differ by design.
  • SHA1_BINARY(CONCAT_WS()) paths in Business Vault links: These can achieve parity with UNHEX(SHA1()) on Databricks if the concatenated string is normalized identically — handled via a purpose-built macro.
  • Cross-environment reconciliation: Performed on natural business keys concatenated with a pipe delimiter, not on surrogate equality.

Surrogate keys were expected to differ across platforms; reconciliation could not rely on hash equality. The team validated against natural business keys instead.

Each layer had its own comparison notebook, with schema-driven column exclusions generated from INFORMATION_SCHEMA for fields like surrogate keys and load timestamps. The test harness shipped with the migration code.

On Databricks, the largest cost drivers were addressed through targeted rewrites: joining against narrow staging dimensions instead of wide mart tables, applying incremental predicates to restrict MERGE targets, and splitting a single wide model into parallel sub-models where the dependency graph allowed it.

 

  • Dual model trees, one repository. A full rewrite creates drift fast. To prevent that, Snowflake and Databricks model trees lived side by side in one repo, with workflow guardrails:
    • Two trees, one repo: differences stayed visible and reviewable.
    • CI lockstep: shared logic changes had to be updated in both implementations.
    • Shared tests: the same tests ran against either tree by model name.
    • No stale branch: the Databricks migration stayed aligned to the active Snowflake codebase.
  • Federation as a live comparison source. Snowflake stayed queryable from Databricks during cutover, giving the team a live baseline for validation.

    • Direct comparison: Databricks outputs could be compared against Snowflake without separate extract jobs.
    • Faster debugging: discrepancies could be traced to migration logic versus data movement timing.
    • Lower cutover risk: Snowflake remained available as a reference point while workloads moved.
  • Reference slice validation. To validate correctness without slowing delivery, the team used seed tables to run targeted information mart checks.
    • Focused scope: validation ran on a representative slice instead of the full dataset.
    • Fast reruns: smaller runs made frequent testing practical.
    • Real signal: known business keys still surfaced meaningful discrepancies before production.
  • Wave-ordered parallel builds. In layered vault architectures, unsafe execution order can create silent inconsistencies when downstream models read stale upstream state.
    • Dependency-aware waves: a script-generated explicit build selectors for the IM Transient layer.
    • Safe parallelism: models ran in parallel within each dependency wave.
    • No stale reads: downstream layers only ran after required upstream models completed.
    • Faster delivery: the team increased parallelism without sacrificing correctness.

 

What leaders should ask

If you're about to run a migration of this scale, there are a few questions worth asking your delivery team in week one:

  • How will you validate parity when surrogate keys differ by design?
  • What's your approach for merges where your legacy warehouse tolerates duplicate source rows?
  • How will the pipeline read from both platforms during cutover, without forking the codebase?
  • How are you sequencing datamart builds to maximize parallelism without breaking dependency order?
  • Are you testing the transformation logic, or just the output data?
  • How will you identify and handle anti-patterns that will surface under a different execution engine?

Cost optimization starts well before the bill arrives. Delivery risk starts well before the first failed validation run. In large-scale migrations, both are shaped by the decisions made before the first model moves: dependency mapping, execution strategy, validation design, data layout, and workload sizing.

The benchmark

Twelve weeks. 847 models migrated. Roughly 77% lower cost per run, with additional upside expected on Serverless.

R1’s Snowflake-to-Databricks migration succeeded because the team did the hard alignment work upfront: architecture, execution sequencing, validation, and engineering standards. That discipline made the effort more than a workload migration. It became a way to improve the operating model itself, with pipelines treated as production software: versioned, tested, measured, and continuously optimized.