cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Live Tables data quality rules application.

SRK
Contributor III

I have a requirement, where I need to apply inverse DQ rule on a table to track the invalid data. For which I can use the following approach:

import dlt

rules = {}

quarantine_rules = {}

rules["valid_website"] = "(Website IS NOT NULL)"

rules["valid_location"] = "(Location IS NOT NULL)"

# concatenate inverse rules

quarantine_rules["invalid_record"] = "NOT({0})".format(" AND ".join(rules.values()))

@dlt.table(

 name="raw_farmers_market"

)

def get_farmers_market_data():

 return (

  spark.read.format('csv').option("header", "true")

   .load('/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/')

 )

@dlt.table(

 name="valid_farmers_market"

)

@dlt.expect_all_or_drop(rules)

def get_valid_farmers_market():

 return (

  dlt.read("raw_farmers_market")

   .select("MarketName", "Website", "Location", "State",

       "Facebook", "Twitter", "Youtube", "Organic", "updateTime")

 )

@dlt.table(

 name="invalid_farmers_market"

)

@dlt.expect_all_or_drop(quarantine_rules)

def get_invalid_farmers_market():

 return (

  dlt.read("raw_farmers_market")

   .select("MarketName", "Website", "Location", "State",

       "Facebook", "Twitter", "Youtube", "Organic", "updateTime")

 )

However, when I store the invalid data in another table i.e., invalid_farmers_market. It will add all the rows which is invalid, but I am trying to apply following 2 rules.

rules["valid_website"] = "(Website IS NOT NULL)"

rules["valid_location"] = "(Location IS NOT NULL)"

I want to know is there is any way how I can understand the specific row is in invalid table because of which specific rule. Either by rules["valid_website"] or rules["valid_location"] or both. So that I can take appropriate action for the specific column.

5 REPLIES 5

daniel_sahal
Esteemed Contributor

@Swapnil Kamle​ 

I don't think there's a way of doing that out of the box.

IMO the best way would be to create a new Boolean columns (valid_website and valid_locaton) or create a view on top of the table that will have true/false indicator.

Thanks for the reply. I will check if that helps. ​

Hubert-Dudek
Esteemed Contributor III

You can get additional info from DLT event log which is in delta so you can load it as table https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-event-log.html#data-qualit...

Thanks for the reply. I will check, how this helps in my case. ​

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Swapnil Kamle​,

Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group