cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How integrate unique PK expectation into LDP pipeline graph

absan
New Contributor III

Hi everyone,

I'm working on a LDP and need help ensuring a downstream table only runs if a primary key unique validation check passes. In something like dbt this is very easy to configure but with LDP it seems to require creating a separate view. Additionally, it's not clear to me how to integrate that extra step into the pipeline graph since checking uniqueness requires aggregating.

Setup: 

 
-- Step 1: Bronze streaming table
CREATE OR REFRESH STREAMING TABLE bronze_data AS
SELECT * FROM STREAM read_files('path/to/data/*.parquet');

-- Step 2: Silver table with deduplication
CREATE OR REFRESH MATERIALIZED VIEW silver_data AS
SELECT * FROM bronze_data
WHERE load_date = (SELECT MAX(load_date) FROM bronze_data);

-- Step 3: Separate PK validation table
CREATE OR REFRESH PRIVATE MATERIALIZED VIEW silver_data_pk_check(
  CONSTRAINT unique_pk EXPECT (record_count = 1)
  ON VIOLATION FAIL UPDATE
) AS 
SELECT primary_key, COUNT(*) as record_count
FROM silver_data
GROUP BY primary_key;

-- Step 4: Gold/Analytics table
CREATE OR REFRESH MATERIALIZED VIEW gold_data AS
SELECT * FROM silver_data
LEFT JOIN dimension_table ON ...;

I guess i could force the dependency with a cross join or where exists but the whole thing feels weird. Is there a pattern how to do this?

4 REPLIES 4

Poorva21
New Contributor

Yes — there is a clean, officially recommended pattern in LDP/Lakeflow, and it avoids all the “feels weird” hacks like CROSS JOIN, WHERE EXISTS, or artificial filters.

The correct pattern is:

Use a PRIVATE MV for the validation step + reference it (lightly) in the downstream MV

This creates a gating dependency without changing query semantics.

It is the Lakeflow equivalent of dbt tests gating downstream models.

absan
New Contributor III

Thank you for your reply. Is this officially recommended pattern documented somewhere?

Hubert-Dudek
Esteemed Contributor III

I know your solution is quite popular (just I don't get SELECT MAX(load_date) ). Another one is to use AUTO CDC even if you don't have CDC, as there is KEY option. If MAX(load_date) means that the last snapshot is most essential for you, please check the AUTO CDC from the SNAPSHOT.

absan
New Contributor III

Thank you for the suggestion @Hubert-Dudek. I explored both those options however my data does not have a consistent key column that would allow me to use AUTO CDCs.

Instead, I'm grabbing the entire batch that was last loaded and then doing a replace downstream using 

INSERT INTO TABLE gold
REPLACE USING loadkey
SELECT * FROM silver

Unfortunately, this last step is not supported in LDP so i have to run it as a separate step in the job. But it is very simple and works well.