What is the Data Quality Framework do you use/recomend ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-27-2023 03:47 PM
Hi guys,
In your opinion what is the best Data Quality Framework (or techinique) do you recommend ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-25-2024 08:39 AM
Hi there!
You could also take a look at Rudol, it has native Databricks support and covers Data Quality validations and Data Governance enabling non-technical roles such as Business Analysts or Data Stewards to be part of data quality as well with no-code validations and integrations with everyday tools like Slack or Microsoft Teams.
Have a high-quality day!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-18-2025 12:19 AM
There are many DQ tools and platforms, but most are SQL based, and thus it costs and its delayed. so it really depends on your use-case and problem statement. sometimes it makes sense to build your own, but most of the time it does not make sense if it should be used as central service.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-18-2025 02:19 AM
DQ is interesting. There are a lot of options in this space. SODA, Great Expectations are kinda well integrate with Databricks setup.
I personally try to use dataframe abstractions for validating. We used deequ tool which is very simple to use, just pass your spark dataframe to the code, and validations happen inside your spark session (if it needs to be), otherwise we can decouple the DQ to separate classes in the package. I have spent some time working with it and created this blog post - https://datatribe.substack.com/p/deequ-an-open-source-data-quality
Its a DQ tool for data engineers I would say. And, interestingly, we can make this deequ dataframes as output delta tables to see the quality patterns. Maintainer is AWSLABS. https://github.com/awslabs/deequ
In addition, I would like to use spark-expectations opensourced by Nike - https://github.com/Nike-Inc/spark-expectations