topic Re: how can I verify that the result of a dlt will have enough rows before updating the table? in Data Engineering

how can I verify that the result of a dlt will have enough rows before updating the table?

yuinagam — Wed, 23 Jul 2025 11:30:16 GMT

I have a dlt/lakeflow pipeline that creates a table, and I need to make sure that it will only update the resulting materialized view if it will have more than one million records.

I've found this, but it seems to only work if I have already updated the table that I want to validate and want to validate it after with a separate job. this wouldn't work for me because I need to ensure that at no point the table will have too few rows. when I tried it with a single pipeline (creating a temporary version of the table, verifying that temporary table, and if the test passed creating the final table) I encountered a problem where `dlt.read("table_name").count()` always equals zero, even if when the table is created I can count it's rows and get more.

I've also tried just using `count(1)` in the `dlt.expect_or_fail` decorator but that always results in an error and doesn't seem to be supported.

In general the question would be how can I verify conditions that involve aggregation over the data in a dlt pipeline, and only apply the update if the verification succeeded?

Re: how can I verify that the result of a dlt will have enough rows before updating the table?

mariadawson — Wed, 23 Jul 2025 11:38:54 GMT

Currently, DLT doesn’t natively support applying expectations or conditional logic based on aggregate metrics like row count within a single pipeline step. That’s why `dlt.expect_or_fail` and trying to count rows within DLT tables doesn’t work as expected.

Re: how can I verify that the result of a dlt will have enough rows before updating the table?

yuinagam — Wed, 23 Jul 2025 11:44:49 GMT

Thank you for the quick reply.

Is there a common/recommended/possible way to work around this limitation? I don't mind not using the expectation api if it doesn't support logic that's based on aggregations.