01-23-2023 03:20 AM
I have a requirement, where I need to apply inverse DQ rule on a table to track the invalid data. For which I can use the following approach:
import dlt
rules = {}
quarantine_rules = {}
rules["valid_website"] = "(Website IS NOT NULL)"
rules["valid_location"] = "(Location IS NOT NULL)"
# concatenate inverse rules
quarantine_rules["invalid_record"] = "NOT({0})".format(" AND ".join(rules.values()))
@dlt.table(
name="raw_farmers_market"
)
def get_farmers_market_data():
return (
spark.read.format('csv').option("header", "true")
.load('/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/')
)
@dlt.table(
name="valid_farmers_market"
)
@dlt.expect_all_or_drop(rules)
def get_valid_farmers_market():
return (
dlt.read("raw_farmers_market")
.select("MarketName", "Website", "Location", "State",
"Facebook", "Twitter", "Youtube", "Organic", "updateTime")
)
@dlt.table(
name="invalid_farmers_market"
)
@dlt.expect_all_or_drop(quarantine_rules)
def get_invalid_farmers_market():
return (
dlt.read("raw_farmers_market")
.select("MarketName", "Website", "Location", "State",
"Facebook", "Twitter", "Youtube", "Organic", "updateTime")
)
However, when I store the invalid data in another table i.e., invalid_farmers_market. It will add all the rows which is invalid, but I am trying to apply following 2 rules.
rules["valid_website"] = "(Website IS NOT NULL)"
rules["valid_location"] = "(Location IS NOT NULL)"
I want to know is there is any way how I can understand the specific row is in invalid table because of which specific rule. Either by rules["valid_website"] or rules["valid_location"] or both. So that I can take appropriate action for the specific column.
01-23-2023 04:42 AM
@Swapnil Kamle
I don't think there's a way of doing that out of the box.
IMO the best way would be to create a new Boolean columns (valid_website and valid_locaton) or create a view on top of the table that will have true/false indicator.
01-23-2023 09:26 AM
Thanks for the reply. I will check if that helps.
01-23-2023 05:06 AM
You can get additional info from DLT event log which is in delta so you can load it as table https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-event-log.html#data-qualit...
01-23-2023 09:27 AM
Thanks for the reply. I will check, how this helps in my case.
02-24-2023 03:28 PM
Hi @Swapnil Kamle,
Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group