10-10-2022 01:49 AM
Hello,
I'm wondering if there is an option to make an expectation on DLT that will compare the number of records between two stages and e.g. fail if there is a difference between those counts?
I mean something like this:
@dlt.table()
def bronze():
Some transformations
@dlt.expect_or_fail("equal_number_of_records", "bronze_table.count() == silver_table.count()"
@dlt.table()
def silver():
Some transformations
10-10-2022 04:28 AM
What if you add a count("*") column to silver, and compare that to the count of bronze (which you put into a variable first)? Like that you compare a column to a scalar value, which I believe will work.
10-11-2022 07:09 AM
Unfortunately, it didnt work that way, expect don't see scalar values saved as variables
@dlt.expect_or_fail("equal_number_of_records", "qa_silver_row_count == bronze_count")
@dlt.table()
def silver():
bronze_count == bronze_table.count()
silver_table = # transformations on bronze table
silver_table = silver_table.withColumn("qa_silver_row_count", F.lit(silver_table.count()))
The way I manage to make it run is:
@dlt.expect_or_fail("equal_number_of_records", "qa_silver_row_count == qa_bronze_row_count")
@dlt.table()
def silver():
bronze_table = #load bronze table
silver_table = upstream_table.withColumn("qa_bronze_row_count", F.lit(bronze_table..count()))
silver_table = # transformations on bronze table
silver_table = silver_table.withColumn("qa_silver_row_count", F.lit(silver_table.count()))
It is a little bit cumbersome. It is a little annoying that DLT actually makes row count automatically but it can't be easily accessed. Maybe it is possible to get this data from the event log table I will try to find out
10-12-2022 04:29 AM
I dig into DLT docs and found this https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-cookbook.html#validate-row... . I guess it solves my problem.
10-12-2022 04:39 AM
great find!
10-16-2022 09:28 AM
The problem is that expectations are deterministic, so SQL query in "expect" needs always to give the same result (for that reason, it is not possible to add that timestamp is from the last 60 minutes in "expect", but the timestamp may be greater than the hardcoded date). That's why it is proposed to create an additional table for checks instead of using "expect".
11-16-2022 10:15 PM
Hi @Jacek Dembowiak
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group