โ03-14-2026 03:45 AM
Hi Databricks Community ๐
Iโm working on designing a practical Entity Resolution / Deduplication framework on Databricks, and I wanted to share a highโlevel blueprint and learn from others whoโve tackled similar problems at scale.
The use case is fairly common across Customer 360, KYC/AML onboarding, master data management, and data migrations โ multiple sources, fragmented identities, and the need to reliably produce golden records while still being explainable and auditable.
Duplicate or slightly varied entity records (names, emails, phones, addresses) across systems lead to:
Hereโs the pattern Iโve been following โ intentionally kept modular and configurable so it can be reused across domains:
Normalization & validation
Blocking strategy
Fuzzy matching & scoring
Clustering
Golden record selection
MLโassisted matching
Humanโinโtheโloop review
The intent is not full automation, but a balanced design that combines ML, rules, and supervised review โ especially important in regulated environments.
If thereโs interest, Iโm happy to follow up with a deeper breakdown (pseudoโcode, table design, orchestration patterns, etc.).
Looking forward to learning from your experiences โ thanks in advance! ๐
โ03-15-2026 02:01 PM
Hi @Mridu,
Rather than building everything greenfield, you can lean heavily on Databricksโ existing Entity Resolution Solution Accelerators (Customer ER, Product Matching, Public Sector ER) as a starting point and then adapt their patterns to your blueprint, instead of maintaining a custom implementation endโtoโend yourself. In my previous projects, Iโve done ER with a mix of thirdโparty tools and custom code, which worked but was always a bit clumsy and highโmaintenance. The current Databricks accelerators and patterns are much more mature and provide a cleaner, more scalable starting point.
I've tried to map the various accelerators readily available to your modular requirements.
On your specific question about blocking at scale, I would recommend you start with multiple composite keys (e.g. phonetic last name + postcode, email username + domain, etc.), plus a semantic/vectorโbased candidate retrieval step for messy text.
In terms of storing reviewer decisions, you can model them as a labelled pair table in Delta (pair ID, entity IDs, label, reviewer, timestamp, key features). The accelerators and blogs use those labels both for ML training/evaluation and as an auditable systemโofโrecord.
If you havenโt already, Iโd start by cloning the Customer Entity Resolution accelerator and then adapting each notebook step to your domain instead of rebuilding the framework from scratch.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
โ03-15-2026 02:01 PM
Hi @Mridu,
Rather than building everything greenfield, you can lean heavily on Databricksโ existing Entity Resolution Solution Accelerators (Customer ER, Product Matching, Public Sector ER) as a starting point and then adapt their patterns to your blueprint, instead of maintaining a custom implementation endโtoโend yourself. In my previous projects, Iโve done ER with a mix of thirdโparty tools and custom code, which worked but was always a bit clumsy and highโmaintenance. The current Databricks accelerators and patterns are much more mature and provide a cleaner, more scalable starting point.
I've tried to map the various accelerators readily available to your modular requirements.
On your specific question about blocking at scale, I would recommend you start with multiple composite keys (e.g. phonetic last name + postcode, email username + domain, etc.), plus a semantic/vectorโbased candidate retrieval step for messy text.
In terms of storing reviewer decisions, you can model them as a labelled pair table in Delta (pair ID, entity IDs, label, reviewer, timestamp, key features). The accelerators and blogs use those labels both for ML training/evaluation and as an auditable systemโofโrecord.
If you havenโt already, Iโd start by cloning the Customer Entity Resolution accelerator and then adapting each notebook step to your domain instead of rebuilding the framework from scratch.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
โ03-15-2026 10:43 PM
Hi @Ashwin_DSA โ thank you for the detailed and thoughtful response, this is extremely helpful.
I completely agree with your point about not reinventing the wheel and leveraging the Databricks Entity Resolution Solution Accelerators as a starting point. The way you mapped the Customer ER / Product Matching / Public Sector ER accelerators to the modular stages I outlined is spot on and aligns very closely with the patterns Iโve seen as well.
My intent with this discussion was less about building a greenfield ER engine from scratch and more about:
The points you called out around:
are particularly useful and reinforce that the accelerators already encode many best practices.
I also like your recommendation of cloning the Customer Entity Resolution accelerator and adapting notebookโbyโnotebook, rather than maintaining a fully custom pipeline endโtoโend โ thatโs a very pragmatic approach and likely reduces longโterm maintenance significantly.
Thanks again for taking the time to share this perspective โ Iโm sure this will be valuable for others exploring ER on Databricks as well.
Happy to mark this as the accepted solution ๐
โ03-15-2026 10:45 PM
Hi @Ashwin_DSA -
Thanks again for the detailed accelerator mapping โ one followโup question Iโm curious about from a delivery perspective:
In regulated BFSI projects, how do teams typically decide where to draw the line between accelerator reuse vs custom extensions (e.g., survivorship rules, review workflows, or domainโspecific features)?
Is there a common โ80/20โ split youโve seen work well in production?
โ03-16-2026 02:40 AM
Hi @Mridu,
I havenโt seen a clean blanket rule that holds across regulated BFSI, or across any domain, to be honest.
In my experience, the accelerators do exactly what the name implies... they give you a solid, opinionated starting point (data model, blocking, features, MLflow, clustering, evaluation), but theyโre not plugโandโplay products. On every serious ER project, there has been a customerโspecific build on top, including bankโspecific survivorship rules, riskโbased thresholds, integration with their MDM/CRM/case systems, and governance/audit flows for compliance.
So Iโd think of it less as a fixed percentage and more as... use the accelerator to get from 0 --> 1 quickly, then expect to invest in the lastโmile work thatโs unique to your organisation and regulators. Iโve never been able to take an accelerator "as is" and industrialise it straight into production.
Since your question specifically relates to BFSI projects, I can share insights from my experience working with a couple of insurance clients, including one that was implementing a similar ER for regulatory reporting, known as sanctions regulations, where we had to ensure we didn't sell policies or conduct business with sanctioned individuals or organisations. We couldn't simply use accelerators "as is". Therefore, my advice would be... try not to alter the accelerator logic. Instead, treat it as a reference implementation and move all bankโspecific behaviours into configuration tables, policies, and workflow layers around it.
Hope this helps.