cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Blueprint: Entity Resolution(Dedup & Golden Records)on Databricks — blocking, fuzzy scoring, MLflow

Mridu
New Contributor II

Hi Databricks Community 👋

I’m working on designing a practical Entity Resolution / Deduplication framework on Databricks, and I wanted to share a high‑level blueprint and learn from others who’ve tackled similar problems at scale.

The use case is fairly common across Customer 360, KYC/AML onboarding, master data management, and data migrations — multiple sources, fragmented identities, and the need to reliably produce golden records while still being explainable and auditable.

Problem

Duplicate or slightly varied entity records (names, emails, phones, addresses) across systems lead to:

  • Poor analytics and reporting
  • Compliance and audit risks
  • Heavy manual review effort that doesn’t scale

High‑level approach on Databricks

Here’s the pattern I’ve been following — intentionally kept modular and configurable so it can be reused across domains:

  1. Normalization & validation

    • Standardize names, emails, phones, and addresses
    • Persist curated Delta tables with lineage/audit metadata
  2. Blocking strategy

    • Reduce candidate pairs using lightweight keys (e.g., phonetic variants, email domain–style grouping)
    • Focus on scalability before expensive comparisons
  3. Fuzzy matching & scoring

    • Weighted similarity scores across name, email, address
    • Generate “potential match” pairs instead of binary decisions
  4. Clustering

    • Group related records using a connected‑components style approach to form entity clusters
  5. Golden record selection

    • Rule‑based selection driven by completeness, recency, and data quality signals
  6. ML‑assisted matching

    • Train a simple classifier on engineered similarity features
    • Register and manage the model with MLflow and use it to auto‑classify high‑confidence matches
  7. Human‑in‑the‑loop review

    • Low‑confidence matches routed for manual review
    • Reviewer decisions written back to Delta tables to support auditability and future retraining

The intent is not full automation, but a balanced design that combines ML, rules, and supervised review — especially important in regulated environments.


What I’d love input on from the community

  • What blocking strategies have worked well for you at scale, especially for name + address heavy datasets?
  • How do you typically decide thresholds between:
    • auto‑merge
    • manual review
    • no‑match
  • Any best practices for storing reviewer decisions so they’re useful for both audit and model improvement?

If there’s interest, I’m happy to follow up with a deeper breakdown (pseudo‑code, table design, orchestration patterns, etc.).

Looking forward to learning from your experiences — thanks in advance! 🙌

1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @Mridu,

Rather than building everything greenfield, you can lean heavily on Databricks’ existing Entity Resolution Solution Accelerators (Customer ER, Product Matching, Public Sector ER) as a starting point and then adapt their patterns to your blueprint, instead of maintaining a custom implementation end‑to‑end yourself. In my previous projects, I’ve done ER with a mix of third‑party tools and custom code, which worked but was always a bit clumsy and high‑maintenance. The current Databricks accelerators and patterns are much more mature and provide a cleaner, more scalable starting point. 

I've tried to map the various accelerators readily available to your modular requirements.

  • Normalization & validation --> In the Customer Entity Resolution accelerator, the first notebooks clean and standardise attributes (names, emails, phones, addresses) into curated Delta tables with audit‑friendly schemas.

  • Blocking strategy --> The same accelerator uses multiple blocking keys/candidate‑generation strategies (e.g., LSH/similarity search on text) to cut down comparisons at scale. The Product Matching and Public Sector ER accelerators follow the same pattern for products and people/orgs.

  • Fuzzy matching & scoring --> Customer ER + Product Matching show how to compute multi‑field similarity scores (string distance, embeddings, etc.) and assemble a single match score per pair.

  • Clustering --> They then group matched pairs into clusters/entities (e.g., using graph‑style connected‑components logic) to get from pairwise to entity‑level views.

  • Golden‑record selection --> Each accelerator includes examples of deterministic survivorship rules (recency, completeness, source priority) to pick the best record per cluster.

  • ML‑assisted matching --> Customer ER (with Zingg) and the Zingg‑based blogs train ML models on engineered similarity features, log them in MLflow, and use them to auto‑classify candidate pairs.

  • Human‑in‑the‑loop review --> The same pattern supports a "grey zone" of scores that go to human review. Reviewer decisions are written back to Delta and reused as labels for retraining and for audit trails.

On your specific question about blocking at scale, I would recommend you start with multiple composite keys (e.g. phonetic last name + postcode, email username + domain, etc.), plus a semantic/vector‑based candidate retrieval step for messy text.

In terms of storing reviewer decisions, you can model them as a labelled pair table in Delta (pair ID, entity IDs, label, reviewer, timestamp, key features). The accelerators and blogs use those labels both for ML training/evaluation and as an auditable system‑of‑record.

If you haven’t already, I’d start by cloning the Customer Entity Resolution accelerator and then adapting each notebook step to your domain instead of rebuilding the framework from scratch.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

4 REPLIES 4

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @Mridu,

Rather than building everything greenfield, you can lean heavily on Databricks’ existing Entity Resolution Solution Accelerators (Customer ER, Product Matching, Public Sector ER) as a starting point and then adapt their patterns to your blueprint, instead of maintaining a custom implementation end‑to‑end yourself. In my previous projects, I’ve done ER with a mix of third‑party tools and custom code, which worked but was always a bit clumsy and high‑maintenance. The current Databricks accelerators and patterns are much more mature and provide a cleaner, more scalable starting point. 

I've tried to map the various accelerators readily available to your modular requirements.

  • Normalization & validation --> In the Customer Entity Resolution accelerator, the first notebooks clean and standardise attributes (names, emails, phones, addresses) into curated Delta tables with audit‑friendly schemas.

  • Blocking strategy --> The same accelerator uses multiple blocking keys/candidate‑generation strategies (e.g., LSH/similarity search on text) to cut down comparisons at scale. The Product Matching and Public Sector ER accelerators follow the same pattern for products and people/orgs.

  • Fuzzy matching & scoring --> Customer ER + Product Matching show how to compute multi‑field similarity scores (string distance, embeddings, etc.) and assemble a single match score per pair.

  • Clustering --> They then group matched pairs into clusters/entities (e.g., using graph‑style connected‑components logic) to get from pairwise to entity‑level views.

  • Golden‑record selection --> Each accelerator includes examples of deterministic survivorship rules (recency, completeness, source priority) to pick the best record per cluster.

  • ML‑assisted matching --> Customer ER (with Zingg) and the Zingg‑based blogs train ML models on engineered similarity features, log them in MLflow, and use them to auto‑classify candidate pairs.

  • Human‑in‑the‑loop review --> The same pattern supports a "grey zone" of scores that go to human review. Reviewer decisions are written back to Delta and reused as labels for retraining and for audit trails.

On your specific question about blocking at scale, I would recommend you start with multiple composite keys (e.g. phonetic last name + postcode, email username + domain, etc.), plus a semantic/vector‑based candidate retrieval step for messy text.

In terms of storing reviewer decisions, you can model them as a labelled pair table in Delta (pair ID, entity IDs, label, reviewer, timestamp, key features). The accelerators and blogs use those labels both for ML training/evaluation and as an auditable system‑of‑record.

If you haven’t already, I’d start by cloning the Customer Entity Resolution accelerator and then adapting each notebook step to your domain instead of rebuilding the framework from scratch.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

Mridu
New Contributor II

Hi @Ashwin_DSA — thank you for the detailed and thoughtful response, this is extremely helpful.

I completely agree with your point about not reinventing the wheel and leveraging the Databricks Entity Resolution Solution Accelerators as a starting point. The way you mapped the Customer ER / Product Matching / Public Sector ER accelerators to the modular stages I outlined is spot on and aligns very closely with the patterns I’ve seen as well.

My intent with this discussion was less about building a greenfield ER engine from scratch and more about:

  • understanding the core architectural patterns behind ER on Databricks, and
  • identifying which parts are best handled by out‑of‑the‑box accelerators versus where teams typically extend or adapt them for domain‑specific needs (especially in regulated environments).

The points you called out around:

  • multi‑key + semantic blocking,
  • graph‑style clustering,
  • deterministic survivorship rules,
  • MLflow‑managed models (including Zingg),
  • and modelling reviewer decisions as a labelled Delta table for both audit and retraining

are particularly useful and reinforce that the accelerators already encode many best practices.

I also like your recommendation of cloning the Customer Entity Resolution accelerator and adapting notebook‑by‑notebook, rather than maintaining a fully custom pipeline end‑to‑end — that’s a very pragmatic approach and likely reduces long‑term maintenance significantly.

Thanks again for taking the time to share this perspective — I’m sure this will be valuable for others exploring ER on Databricks as well.

Happy to mark this as the accepted solution 👍

Mridu
New Contributor II

Hi @Ashwin_DSA -

Thanks again for the detailed accelerator mapping — one follow‑up question I’m curious about from a delivery perspective:

In regulated BFSI projects, how do teams typically decide where to draw the line between accelerator reuse vs custom extensions (e.g., survivorship rules, review workflows, or domain‑specific features)?

Is there a common “80/20” split you’ve seen work well in production?

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @Mridu,

I haven’t seen a clean blanket rule that holds across regulated BFSI, or across any domain, to be honest. 

In my experience, the accelerators do exactly what the name implies... they give you a solid, opinionated starting point (data model, blocking, features, MLflow, clustering, evaluation), but they’re not plug‑and‑play products. On every serious ER project, there has been a customer‑specific build on top, including bank‑specific survivorship rules, risk‑based thresholds, integration with their MDM/CRM/case systems, and governance/audit flows for compliance.

So I’d think of it less as a fixed percentage and more as... use the accelerator to get from 0 --> 1 quickly, then expect to invest in the last‑mile work that’s unique to your organisation and regulators. I’ve never been able to take an accelerator "as is" and industrialise it straight into production.

Since your question specifically relates to BFSI projects, I can share insights from my experience working with a couple of insurance clients, including one that was implementing a similar ER for regulatory reporting, known as sanctions regulations, where we had to ensure we didn't sell policies or conduct business with sanctioned individuals or organisations. We couldn't simply use accelerators "as is". Therefore, my advice would be... try not to alter the accelerator logic. Instead, treat it as a reference implementation and move all bank‑specific behaviours into configuration tables, policies, and workflow layers around it.

Hope this helps.

 

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***