cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Primary key constraint not working

AanchalSoni
Databricks Partner

I've created a Lakeflow job to run 5 notebook tasks, one for each silver table- Customers, Accounts, Transactions, Loans and Branches.

In Customers notebook, after writing the data to delta table using auto loader, I'm applying the non null and primary key constraint on customer_id:

 
spark.sql("ALTER TABLE capstone_project.silver.customers_silver ALTER COLUMN customer_id SET NOT NULL")
spark.sql("ALTER TABLE capstone_project.silver.customers_silver ADD CONSTRAINT pk_customer_id PRIMARY KEY (customer_id)")
 
Rest other tables have a foreign key constraint in a similar fashion on customer_id. And their relevant notebooks are dependent on customers notebook.
 
After the successful execution of customers notebook, the rest others throw an error that there is no primary key constraint in customers table. When I check the customers table, yes! there exists no primary key.
 
But, I have a code that has got executed and returned a dataframe. Then where did the primary key go? I'm unable to identify the root cause. Please assist
1 ACCEPTED SOLUTION

Accepted Solutions

balajij8
Contributor

@AanchalSoni 

Capturing the columns as Primary key helps users and tools understand relationships in the data. You can create Primary Key with RELY for optimization in some cases by skipping redundant operations.

  • Distinct Elimination
    When you apply a DISTINCT operator to a column marked as a PRIMARY KEY with RELY, the optimizer knows every value is already unique. It skips the expensive shuffle and sort required to get it.

    SELECT DISTINCT p_product_id FROM products will be treated as a SELECT p_product_id FROM products, removing the overhead

  • Join Elimination
    In a LEFT OUTER JOIN where you only select columns from the Fact table, the optimizer can eliminate the join if it knows the Dimension table is joined on a unique primary key.

    SELECT SUM(oi_sales_price) 
    FROM order_items oi
    LEFT JOIN promotions p ON oi.promo_id = p.p_promo_id;
    If p_promo_id is a RELY primary key, the optimizer realizes the join doesn't change the SUM result. It executes only the scan on the Fact table - SELECT SUM(oi_sales_price) FROM order_items oi;

     

View solution in original post

7 REPLIES 7

Sumit_7
Honored Contributor II

@AanchalSoni  In Databricks, there's no concept of Primary/Foreign Key as Delta Tables are not a relational entity like the traditional RDMS -- just informational only. You may apply NOT NULL or set EXPECTATIONS in Lakeflow to control the execution and data quality.

balajij8
Contributor

Hi @AanchalSoni 

Primary & Foreign Keys in Unity Catalog are Informational Only. Databricks does not enforce uniqueness during writes to avoid the massive performance overhead. You are responsible for maintaining data integrity in the pipeline logic. Table definition is just the blueprint for documentation and optimization. You can use below

  • MERGE: Instead of INSERT, use MERGE to upsert records as its the standard way to prevent duplicates in Delta Lake.
  • Deduplication: Use row number, over & partition by (over (partition by id order by timestamp desc) = 1) in your code before writing to the tables
  • Check Constraints: While Primary & Foreign Keys are not enforced, NOT NULL and CHECK constraints are strictly enforced.

AanchalSoni
Databricks Partner

Thanks @Sumit_7 @balajij8 for the clarification. I also read the same about primary keys and foreign keys, however got confused when I saw the table Overview. It distinctly highlights PK for customer_id. Please check the screenshot. Also, I was getting errors for other tables when they ran before customers table- 'primary key doesn't exist in customers table.' 

What does informational mean?

emma_s
Databricks Employee
Databricks Employee

Hi,

If you using declarative pipelines then databricks respects the definition that you gave on the table declaration. If you subsequently do an alter table statement in the same pipeline it won't preserve it.

What you should do is either set the primary key in the declarative pipeline table definition or put the alter table as an additional task in the job. Although Databricks doesn't enforce primary key constraints it should still be aware of them.

CREATE OR REFRESH STREAMING TABLE orders_valid(
  CONSTRAINT pk_customer_id PRIMARY KEY (customer_id)
)
AS SELECT * FROM STREAM read_files("/databricks-datasets/retail-org/sales_orders");

Thanks,

Emma

AanchalSoni
Databricks Partner

Thanks @emma_s ! This helps!

balajij8
Contributor

@AanchalSoni 

Capturing the columns as Primary key helps users and tools understand relationships in the data. You can create Primary Key with RELY for optimization in some cases by skipping redundant operations.

  • Distinct Elimination
    When you apply a DISTINCT operator to a column marked as a PRIMARY KEY with RELY, the optimizer knows every value is already unique. It skips the expensive shuffle and sort required to get it.

    SELECT DISTINCT p_product_id FROM products will be treated as a SELECT p_product_id FROM products, removing the overhead

  • Join Elimination
    In a LEFT OUTER JOIN where you only select columns from the Fact table, the optimizer can eliminate the join if it knows the Dimension table is joined on a unique primary key.

    SELECT SUM(oi_sales_price) 
    FROM order_items oi
    LEFT JOIN promotions p ON oi.promo_id = p.p_promo_id;
    If p_promo_id is a RELY primary key, the optimizer realizes the join doesn't change the SUM result. It executes only the scan on the Fact table - SELECT SUM(oi_sales_price) FROM order_items oi;

     

@balajij8 thanks again!

The explanation totally makes sense now.