cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Live Tables data quality rules application.

SRK
Contributor III

I have a requirement, where I need to apply inverse DQ rule on a table to track the invalid data. For which I can use the following approach:

import dlt

rules = {}

quarantine_rules = {}

rules["valid_website"] = "(Website IS NOT NULL)"

rules["valid_location"] = "(Location IS NOT NULL)"

# concatenate inverse rules

quarantine_rules["invalid_record"] = "NOT({0})".format(" AND ".join(rules.values()))

@dlt.table(

 name="raw_farmers_market"

)

def get_farmers_market_data():

 return (

  spark.read.format('csv').option("header", "true")

   .load('/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/')

 )

@dlt.table(

 name="valid_farmers_market"

)

@dlt.expect_all_or_drop(rules)

def get_valid_farmers_market():

 return (

  dlt.read("raw_farmers_market")

   .select("MarketName", "Website", "Location", "State",

       "Facebook", "Twitter", "Youtube", "Organic", "updateTime")

 )

@dlt.table(

 name="invalid_farmers_market"

)

@dlt.expect_all_or_drop(quarantine_rules)

def get_invalid_farmers_market():

 return (

  dlt.read("raw_farmers_market")

   .select("MarketName", "Website", "Location", "State",

       "Facebook", "Twitter", "Youtube", "Organic", "updateTime")

 )

However, when I store the invalid data in another table i.e., invalid_farmers_market. It will add all the rows which is invalid, but I am trying to apply following 2 rules.

rules["valid_website"] = "(Website IS NOT NULL)"

rules["valid_location"] = "(Location IS NOT NULL)"

I want to know is there is any way how I can understand the specific row is in invalid table because of which specific rule. Either by rules["valid_website"] or rules["valid_location"] or both. So that I can take appropriate action for the specific column.

5 REPLIES 5

daniel_sahal
Honored Contributor III

@Swapnil Kamle​ 

I don't think there's a way of doing that out of the box.

IMO the best way would be to create a new Boolean columns (valid_website and valid_locaton) or create a view on top of the table that will have true/false indicator.

Thanks for the reply. I will check if that helps. ​

Hubert-Dudek
Esteemed Contributor III

You can get additional info from DLT event log which is in delta so you can load it as table https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-event-log.html#data-qualit...

Thanks for the reply. I will check, how this helps in my case. ​

Hi @Swapnil Kamle​,

Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.