cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

DQ Expectations Best Practice

ChristianRRL
Contributor III

Hi there, I hope this is a fairly simple and straightforward question. I'm wondering if there's a "general" consensus on where along the DLT data ingestion + transformation process should data quality expectations be applied? For example, two very simple kinds of expectations I can think of are:

  1. Check/Ensure data types are being set appropriately
  2. Check/Ensure data is unique (deduplicated)

In the above, is it fair to say that (1) data types should be checked in bronze/raw data (before data even gets a chance to get ingested), or possibly in a bronze view? Or are these kinds of data type checks usually done in clean/silver data? As for (2) data deduplication, I can imagine this is more straightforward as this would only make sense for me to apply either in a bronze view (on top of the bronze table) or the following silver table itself.

2 REPLIES 2

ilarsen
Contributor

I'll offer my opinion.  I see both of those checks (and treatments, if you're converting types for example) as something for the clean/silver/staging/whatever-you-call-it layer.  For us, our bronze layer represents the source data as-is, with SCD type 2 behaviour to keep history.  The only deduplication we're doing in our bronze layer processing is during the structured streaming merge microbatch function.

joarobles
New Contributor III

I'll drop my two cents here: having multiple layer validations reduce the effort needed to find the root cause of a data incident, but it has a drawback: they are harder to maintain.

Every layer has a set of rules to be enforced and there will be assets that are more critical than others, so here prioritization is key: discover which assets are consumed the most and apply validations there first.

You could take a look at Rudol Data Quality that has native Databricks integration and allows you to create quality checks based on "policies" to validate multiple tables at once, and propagate the validations across the whole linage.

Have a high-quality week!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group