cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Live Tables data quality rules application.

SRK
Contributor III

I have a requirement, where I need to apply inverse DQ rule on a table to track the invalid data. For which I can use the following approach:

import dlt

rules = {}

quarantine_rules = {}

rules["valid_website"] = "(Website IS NOT NULL)"

rules["valid_location"] = "(Location IS NOT NULL)"

# concatenate inverse rules

quarantine_rules["invalid_record"] = "NOT({0})".format(" AND ".join(rules.values()))

@dlt.table(

 name="raw_farmers_market"

)

def get_farmers_market_data():

 return (

  spark.read.format('csv').option("header", "true")

   .load('/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/')

 )

@dlt.table(

 name="valid_farmers_market"

)

@dlt.expect_all_or_drop(rules)

def get_valid_farmers_market():

 return (

  dlt.read("raw_farmers_market")

   .select("MarketName", "Website", "Location", "State",

       "Facebook", "Twitter", "Youtube", "Organic", "updateTime")

 )

@dlt.table(

 name="invalid_farmers_market"

)

@dlt.expect_all_or_drop(quarantine_rules)

def get_invalid_farmers_market():

 return (

  dlt.read("raw_farmers_market")

   .select("MarketName", "Website", "Location", "State",

       "Facebook", "Twitter", "Youtube", "Organic", "updateTime")

 )

However, when I store the invalid data in another table i.e., invalid_farmers_market. It will add all the rows which is invalid, but I am trying to apply following 2 rules.

rules["valid_website"] = "(Website IS NOT NULL)"

rules["valid_location"] = "(Location IS NOT NULL)"

I want to know is there is any way how I can understand the specific row is in invalid table because of which specific rule. Either by rules["valid_website"] or rules["valid_location"] or both. So that I can take appropriate action for the specific column.

5 REPLIES 5

daniel_sahal
Esteemed Contributor

@Swapnil Kamle​ 

I don't think there's a way of doing that out of the box.

IMO the best way would be to create a new Boolean columns (valid_website and valid_locaton) or create a view on top of the table that will have true/false indicator.

Thanks for the reply. I will check if that helps. ​

Hubert-Dudek
Esteemed Contributor III

You can get additional info from DLT event log which is in delta so you can load it as table https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-event-log.html#data-qualit...

Thanks for the reply. I will check, how this helps in my case. ​

Hi @Swapnil Kamle​,

Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!