Databricks master data management capabilities
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-02-2024 09:04 AM
Hi there,
Please, I am trying to understand if Databricks is able to support master data management capabilities. Particularly, focusing on the following ones:
- Integrate and link different data systems: Connect various systems and make sure the data stays consistent across all of them (e.g., when a record is updated in one system, the change is automatically reflected in all connected systems)
- Manage data standardization rules: Establish and enforce rules to ensure data remains consistent across the organization (e.g., defining a standard format for date fields)
Please, any help or guidance is highly appreciated.
Thanks a lot!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-26-2024 11:33 PM - edited 10-26-2024 11:35 PM
Hi,
Databricks is meant for that & it supports what ever feature you had asked in the OP.
1) You can ingest data from various sources - in both batch & streaming mode, various formats of data through Lakehouse architecture
2) In terms of data consistency while reading, also to manipulate the data / durability - pls read about Delta lake, open storage format with ACID support.
3)Yes, you can enforce constraints, schema enforcement, RBAC, Unity catalog - to centrally manage & follow data governance, compliance etc.
Here are some reference links:
https://docs.databricks.com/en/ingestion/index.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-18-2024 10:47 AM
Databricks supports MDM in the way that any off the shelf database also can--you just have to write all the code to handle the data standardization, survivorship and entity resolution rules. You can absolutely do MDM in Databricks, the medallion architecture corresponds nicely to how traditional MDM systems categorize data, and the flexibility of Delta Live Tables pipelines makes it easy to write all the code you need. Streaming tables can be used for near real time MDM, while the scalability of the Spark/Photon compute means you can also handle gigantic batches of data.
You can import Python libraries to assist with your coding and use those libraries in your Delta Live Tables pipelines. But you are "build first", and I've worked at places with that mentality and it's fine, you just have to maintain a lot of code (but, you can get very customized MDM at potentially lower cost).
An advantage that Databricks has is the Databricks Marketplace. You can subscribe to services from Dun and Bradstreet or Experian which will do the standardization and entity resolution for you, then return the golden records. Another great feature Databricks has is the data expectations, which are used to measure data quality before any MDM work is done, and can quarantine bad data to keep it from ruining your golden records. Both the Marketplace services and expectations are used in your DLT pipelines.
I've built entirely custom MDM systems on SQL Server. They work, and they easily fit in to your enterprise's service inventory, but they really require a dedicated team to maintain the system. MDM systems like Reltio are also lesser impact on your enterprise, but are SaaS so all you do is maintain the rules. I describe systems like Informatica as "an enterprise lifestyle"--your business conforms to how they run.
In the end, all MDM systems cost money and take effort. The decision largely comes down to whether you want to pay your own developers and system engineers, or someone else's. All MDM systems have the same need for data quality, data governance and business rules governance, so those are the same across the board. It is possible for different companies to take different paths, and all be right.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yesterday
Thank you for this insightful post! I really appreciate the detailed breakdown and the valuable perspective on MDM solution on Databricks. The advancements in Databricks continue to be impressive!
I have a quick question regarding Master Data Management (MDM) - how do you see AI/Gen AI capabilities enhancing MDM within the Databricks ecosystem? Are there any best practices or recommended approaches for leveraging AI to improve data governance and entity resolution in MDM?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-21-2024 11:51 AM
I failed to mention above, Databricks has several solution accelerators which support MDM/ER types of work. They are meant to be examples of how, not to be used directly out of the box.
Customer Entity Resolution | Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Friday
We at Frisco Analytics are a built on partner for Databricks and built and MDM application that is native to Databricks. https://www.lakefusion.ai/ Please contact us at contact@friscoanalytics.com .
Thank you!
Haritha

