topic Re: How integrate unique PK expectation into LDP pipeline graph in Data Engineering

How integrate unique PK expectation into LDP pipeline graph

absan — Mon, 08 Dec 2025 16:37:47 GMT

Hi everyone,

I'm working on a LDP and need help ensuring a downstream table only runs if a primary key unique validation check passes. In something like dbt this is very easy to configure but with LDP it seems to require creating a separate view. Additionally, it's not clear to me how to integrate that extra step into the pipeline graph since checking uniqueness requires aggregating.

Setup:

-- Step 1: Bronze streaming table CREATE OR REFRESH STREAMING TABLE bronze_data AS SELECT * FROM STREAM read_files('path/to/data/*.parquet'); -- Step 2: Silver table with deduplication CREATE OR REFRESH MATERIALIZED VIEW silver_data AS SELECT * FROM bronze_data WHERE load_date = (SELECT MAX(load_date) FROM bronze_data); -- Step 3: Separate PK validation table CREATE OR REFRESH PRIVATE MATERIALIZED VIEW silver_data_pk_check( CONSTRAINT unique_pk EXPECT (record_count = 1) ON VIOLATION FAIL UPDATE ) AS SELECT primary_key, COUNT(*) as record_count FROM silver_data GROUP BY primary_key; -- Step 4: Gold/Analytics table CREATE OR REFRESH MATERIALIZED VIEW gold_data AS SELECT * FROM silver_data LEFT JOIN dimension_table ON ...;

I guess i could force the dependency with a cross join or where exists but the whole thing feels weird. Is there a pattern how to do this?

Re: How integrate unique PK expectation into LDP pipeline graph

Poorva21 — Mon, 08 Dec 2025 16:44:43 GMT

Yes — there is a clean, officially recommended pattern in LDP/Lakeflow, and it avoids all the “feels weird” hacks like CROSS JOIN, WHERE EXISTS, or artificial filters.

The correct pattern is:

Use a PRIVATE MV for the validation step + reference it (lightly) in the downstream MV

This creates a gating dependency without changing query semantics.

It is the Lakeflow equivalent of dbt tests gating downstream models.

Re: How integrate unique PK expectation into LDP pipeline graph

Hubert-Dudek — Mon, 08 Dec 2025 16:49:17 GMT

I know your solution is quite popular (just I don't get SELECT MAX(load_date) ). Another one is to use AUTO CDC even if you don't have CDC, as there is KEY option. If MAX(load_date) means that the last snapshot is most essential for you, please check the AUTO CDC from the SNAPSHOT.

Re: How integrate unique PK expectation into LDP pipeline graph

absan — Mon, 08 Dec 2025 17:41:46 GMT

Thank you for your reply. Is this officially recommended pattern documented somewhere?

Re: How integrate unique PK expectation into LDP pipeline graph

absan — Mon, 08 Dec 2025 17:56:18 GMT

Thank you for the suggestion @Hubert-Dudek. I explored both those options however my data does not have a consistent key column that would allow me to use AUTO CDCs.

Instead, I'm grabbing the entire batch that was last loaded and then doing a replace downstream using

INSERT INTO TABLE gold
REPLACE USING loadkey
SELECT * FROM silver

Unfortunately, this last step is not supported in LDP so i have to run it as a separate step in the job. But it is very simple and works well.