a week ago
Hi Databricks Community 👋
I’m working on designing a practical Entity Resolution / Deduplication framework on Databricks, and I wanted to share a high‑level blueprint and learn from others who’ve tackled similar problems at scale.
The use case is fairly common across Customer 360, KYC/AML onboarding, master data management, and data migrations — multiple sources, fragmented identities, and the need to reliably produce golden records while still being explainable and auditable.
Duplicate or slightly varied entity records (names, emails, phones, addresses) across systems lead to:
Here’s the pattern I’ve been following — intentionally kept modular and configurable so it can be reused across domains:
Normalization & validation
Blocking strategy
Fuzzy matching & scoring
Clustering
Golden record selection
ML‑assisted matching
Human‑in‑the‑loop review
The intent is not full automation, but a balanced design that combines ML, rules, and supervised review — especially important in regulated environments.
If there’s interest, I’m happy to follow up with a deeper breakdown (pseudo‑code, table design, orchestration patterns, etc.).
Looking forward to learning from your experiences — thanks in advance! 🙌
a week ago
Hi @Mridu,
Rather than building everything greenfield, you can lean heavily on Databricks’ existing Entity Resolution Solution Accelerators (Customer ER, Product Matching, Public Sector ER) as a starting point and then adapt their patterns to your blueprint, instead of maintaining a custom implementation end‑to‑end yourself. In my previous projects, I’ve done ER with a mix of third‑party tools and custom code, which worked but was always a bit clumsy and high‑maintenance. The current Databricks accelerators and patterns are much more mature and provide a cleaner, more scalable starting point.
I've tried to map the various accelerators readily available to your modular requirements.
On your specific question about blocking at scale, I would recommend you start with multiple composite keys (e.g. phonetic last name + postcode, email username + domain, etc.), plus a semantic/vector‑based candidate retrieval step for messy text.
In terms of storing reviewer decisions, you can model them as a labelled pair table in Delta (pair ID, entity IDs, label, reviewer, timestamp, key features). The accelerators and blogs use those labels both for ML training/evaluation and as an auditable system‑of‑record.
If you haven’t already, I’d start by cloning the Customer Entity Resolution accelerator and then adapting each notebook step to your domain instead of rebuilding the framework from scratch.
If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.
a week ago
Hi @Mridu,
Rather than building everything greenfield, you can lean heavily on Databricks’ existing Entity Resolution Solution Accelerators (Customer ER, Product Matching, Public Sector ER) as a starting point and then adapt their patterns to your blueprint, instead of maintaining a custom implementation end‑to‑end yourself. In my previous projects, I’ve done ER with a mix of third‑party tools and custom code, which worked but was always a bit clumsy and high‑maintenance. The current Databricks accelerators and patterns are much more mature and provide a cleaner, more scalable starting point.
I've tried to map the various accelerators readily available to your modular requirements.
On your specific question about blocking at scale, I would recommend you start with multiple composite keys (e.g. phonetic last name + postcode, email username + domain, etc.), plus a semantic/vector‑based candidate retrieval step for messy text.
In terms of storing reviewer decisions, you can model them as a labelled pair table in Delta (pair ID, entity IDs, label, reviewer, timestamp, key features). The accelerators and blogs use those labels both for ML training/evaluation and as an auditable system‑of‑record.
If you haven’t already, I’d start by cloning the Customer Entity Resolution accelerator and then adapting each notebook step to your domain instead of rebuilding the framework from scratch.
If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.
a week ago
Hi @Ashwin_DSA — thank you for the detailed and thoughtful response, this is extremely helpful.
I completely agree with your point about not reinventing the wheel and leveraging the Databricks Entity Resolution Solution Accelerators as a starting point. The way you mapped the Customer ER / Product Matching / Public Sector ER accelerators to the modular stages I outlined is spot on and aligns very closely with the patterns I’ve seen as well.
My intent with this discussion was less about building a greenfield ER engine from scratch and more about:
The points you called out around:
are particularly useful and reinforce that the accelerators already encode many best practices.
I also like your recommendation of cloning the Customer Entity Resolution accelerator and adapting notebook‑by‑notebook, rather than maintaining a fully custom pipeline end‑to‑end — that’s a very pragmatic approach and likely reduces long‑term maintenance significantly.
Thanks again for taking the time to share this perspective — I’m sure this will be valuable for others exploring ER on Databricks as well.
Happy to mark this as the accepted solution 👍
a week ago
Hi @Ashwin_DSA -
Thanks again for the detailed accelerator mapping — one follow‑up question I’m curious about from a delivery perspective:
In regulated BFSI projects, how do teams typically decide where to draw the line between accelerator reuse vs custom extensions (e.g., survivorship rules, review workflows, or domain‑specific features)?
Is there a common “80/20” split you’ve seen work well in production?
a week ago
Hi @Mridu,
I haven’t seen a clean blanket rule that holds across regulated BFSI, or across any domain, to be honest.
In my experience, the accelerators do exactly what the name implies... they give you a solid, opinionated starting point (data model, blocking, features, MLflow, clustering, evaluation), but they’re not plug‑and‑play products. On every serious ER project, there has been a customer‑specific build on top, including bank‑specific survivorship rules, risk‑based thresholds, integration with their MDM/CRM/case systems, and governance/audit flows for compliance.
So I’d think of it less as a fixed percentage and more as... use the accelerator to get from 0 --> 1 quickly, then expect to invest in the last‑mile work that’s unique to your organisation and regulators. I’ve never been able to take an accelerator "as is" and industrialise it straight into production.
Since your question specifically relates to BFSI projects, I can share insights from my experience working with a couple of insurance clients, including one that was implementing a similar ER for regulatory reporting, known as sanctions regulations, where we had to ensure we didn't sell policies or conduct business with sanctioned individuals or organisations. We couldn't simply use accelerators "as is". Therefore, my advice would be... try not to alter the accelerator logic. Instead, treat it as a reference implementation and move all bank‑specific behaviours into configuration tables, policies, and workflow layers around it.
Hope this helps.