12-08-2025 08:37 AM
Hi everyone,
I'm working on a LDP and need help ensuring a downstream table only runs if a primary key unique validation check passes. In something like dbt this is very easy to configure but with LDP it seems to require creating a separate view. Additionally, it's not clear to me how to integrate that extra step into the pipeline graph since checking uniqueness requires aggregating.
Setup:
-- Step 1: Bronze streaming table
CREATE OR REFRESH STREAMING TABLE bronze_data AS
SELECT * FROM STREAM read_files('path/to/data/*.parquet');
-- Step 2: Silver table with deduplication
CREATE OR REFRESH MATERIALIZED VIEW silver_data AS
SELECT * FROM bronze_data
WHERE load_date = (SELECT MAX(load_date) FROM bronze_data);
-- Step 3: Separate PK validation table
CREATE OR REFRESH PRIVATE MATERIALIZED VIEW silver_data_pk_check(
CONSTRAINT unique_pk EXPECT (record_count = 1)
ON VIOLATION FAIL UPDATE
) AS
SELECT primary_key, COUNT(*) as record_count
FROM silver_data
GROUP BY primary_key;
-- Step 4: Gold/Analytics table
CREATE OR REFRESH MATERIALIZED VIEW gold_data AS
SELECT * FROM silver_data
LEFT JOIN dimension_table ON ...;I guess i could force the dependency with a cross join or where exists but the whole thing feels weird. Is there a pattern how to do this?
12-08-2025 09:56 AM
Thank you for the suggestion @Hubert-Dudek. I explored both those options however my data does not have a consistent key column that would allow me to use AUTO CDCs.
Instead, I'm grabbing the entire batch that was last loaded and then doing a replace downstream using
INSERT INTO TABLE gold
REPLACE USING loadkey
SELECT * FROM silver
Unfortunately, this last step is not supported in LDP so i have to run it as a separate step in the job. But it is very simple and works well.
12-08-2025 08:44 AM
Yes — there is a clean, officially recommended pattern in LDP/Lakeflow, and it avoids all the “feels weird” hacks like CROSS JOIN, WHERE EXISTS, or artificial filters.
The correct pattern is:
Use a PRIVATE MV for the validation step + reference it (lightly) in the downstream MV
This creates a gating dependency without changing query semantics.
It is the Lakeflow equivalent of dbt tests gating downstream models.
12-08-2025 09:41 AM
Thank you for your reply. Is this officially recommended pattern documented somewhere?
12-08-2025 08:49 AM
I know your solution is quite popular (just I don't get SELECT MAX(load_date) ). Another one is to use AUTO CDC even if you don't have CDC, as there is KEY option. If MAX(load_date) means that the last snapshot is most essential for you, please check the AUTO CDC from the SNAPSHOT.
12-08-2025 09:56 AM
Thank you for the suggestion @Hubert-Dudek. I explored both those options however my data does not have a consistent key column that would allow me to use AUTO CDCs.
Instead, I'm grabbing the entire batch that was last loaded and then doing a replace downstream using
INSERT INTO TABLE gold
REPLACE USING loadkey
SELECT * FROM silver
Unfortunately, this last step is not supported in LDP so i have to run it as a separate step in the job. But it is very simple and works well.