โ10-02-2024 09:04 AM
Hi there,
Please, I am trying to understand if Databricks is able to support master data management capabilities. Particularly, focusing on the following ones:
- Integrate and link different data systems: Connect various systems and make sure the data stays consistent across all of them (e.g., when a record is updated in one system, the change is automatically reflected in all connected systems)
- Manage data standardization rules: Establish and enforce rules to ensure data remains consistent across the organization (e.g., defining a standard format for date fields)
Please, any help or guidance is highly appreciated.
Thanks a lot!
โ10-26-2024 11:33 PM - edited โ10-26-2024 11:35 PM
Hi,
Databricks is meant for that & it supports what ever feature you had asked in the OP.
1) You can ingest data from various sources - in both batch & streaming mode, various formats of data through Lakehouse architecture
2) In terms of data consistency while reading, also to manipulate the data / durability - pls read about Delta lake, open storage format with ACID support.
3)Yes, you can enforce constraints, schema enforcement, RBAC, Unity catalog - to centrally manage & follow data governance, compliance etc.
Here are some reference links:
https://docs.databricks.com/en/ingestion/index.html
โ11-18-2024 10:47 AM
Databricks supports MDM in the way that any off the shelf database also can--you just have to write all the code to handle the data standardization, survivorship and entity resolution rules. You can absolutely do MDM in Databricks, the medallion architecture corresponds nicely to how traditional MDM systems categorize data, and the flexibility of Delta Live Tables pipelines makes it easy to write all the code you need. Streaming tables can be used for near real time MDM, while the scalability of the Spark/Photon compute means you can also handle gigantic batches of data.
You can import Python libraries to assist with your coding and use those libraries in your Delta Live Tables pipelines. But you are "build first", and I've worked at places with that mentality and it's fine, you just have to maintain a lot of code (but, you can get very customized MDM at potentially lower cost).
An advantage that Databricks has is the Databricks Marketplace. You can subscribe to services from Dun and Bradstreet or Experian which will do the standardization and entity resolution for you, then return the golden records. Another great feature Databricks has is the data expectations, which are used to measure data quality before any MDM work is done, and can quarantine bad data to keep it from ruining your golden records. Both the Marketplace services and expectations are used in your DLT pipelines.
I've built entirely custom MDM systems on SQL Server. They work, and they easily fit in to your enterprise's service inventory, but they really require a dedicated team to maintain the system. MDM systems like Reltio are also lesser impact on your enterprise, but are SaaS so all you do is maintain the rules. I describe systems like Informatica as "an enterprise lifestyle"--your business conforms to how they run.
In the end, all MDM systems cost money and take effort. The decision largely comes down to whether you want to pay your own developers and system engineers, or someone else's. All MDM systems have the same need for data quality, data governance and business rules governance, so those are the same across the board. It is possible for different companies to take different paths, and all be right.
โ02-17-2025 05:51 AM
Thank you for this insightful post! I really appreciate the detailed breakdown and the valuable perspective on MDM solution on Databricks. The advancements in Databricks continue to be impressive!
I have a quick question regarding Master Data Management (MDM) - how do you see AI/Gen AI capabilities enhancing MDM within the Databricks ecosystem? Are there any best practices or recommended approaches for leveraging AI to improve data governance and entity resolution in MDM?
โ04-10-2025 01:04 PM
One capability AI might bring to MDM is name synonyms. One of the problems computers have is that "Rich" and "Richard" are different strings, but they are variations of the same name. This leads to a lot of false negatives in matching data from informal sources to data from formal sources. Same goes for Will/Bill for William, Bob/Rob for Robert, and Tom/Thom for Thomas. Where things get crazy is non obvious names, such as Liz/Beth/Bess for Elizabeth, or Jack for John (ref: President Kennedy). An LLM, especially one which is very familiar with names, could make that part much easier.
Phone numbers and addresses conform to specific rules, and can be formatted with simple algorithms or by referencing a central database (like USPS CASS for addresses). Even the road name synonyms in the US are just reference data, no AI needed really.
MDM matching is very rules driven, and not every enterprise will have the same rules or tolerance for false positives or false negatives. I can't see an LLM taking over the matching and data survivorship, they're just not built for that. After working with LangChain and LangGraph the last few days, I can see how you might be able to orchestrate tools and agents to replace a traditional MDM or business rules engine. But you're not really using any AI/Gen AI at that point, you're just using a slightly more flexible rules engine.
If a vendor came to me and said they have a totally AI powered MDM system, I would be extremely skeptical and somewhat nervous about how well it would work.
โ11-21-2024 11:51 AM
I failed to mention above, Databricks has several solution accelerators which support MDM/ER types of work. They are meant to be examples of how, not to be used directly out of the box.
Customer Entity Resolution | Databricks
โ02-14-2025 02:25 PM
We at Frisco Analytics are a built on partner for Databricks and built and MDM application that is native to Databricks. https://www.lakefusion.ai/ Please contact us at contact@friscoanalytics.com .
Thank you!
Haritha
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now