Databricks Community

AanchalSoni · 10 hours ago

I've created a Lakeflow job to run 5 notebook tasks, one for each silver table- Customers, Accounts, Transactions, Loans and Branches.

In Customers notebook, after writing the data to delta table using auto loader, I'm applying the non null and primary key constraint on customer_id:

spark.sql("ALTER TABLE capstone_project.silver.customers_silver ALTER COLUMN customer_id SET NOT NULL")

spark.sql("ALTER TABLE capstone_project.silver.customers_silver ADD CONSTRAINT pk_customer_id PRIMARY KEY (customer_id)")

Rest other tables have a foreign key constraint in a similar fashion on customer_id. And their relevant notebooks are dependent on customers notebook.

After the successful execution of customers notebook, the rest others throw an error that there is no primary key constraint in customers table. When I check the customers table, yes! there exists no primary key.

But, I have a code that has got executed and returned a dataframe. Then where did the primary key go? I'm unable to identify the root cause. Please assist

balajij8 · 6 hours ago

@AanchalSoni

Capturing the columns as Primary key helps users and tools understand relationships in the data. You can create Primary Key with RELY for optimization in some cases by skipping redundant operations.

Distinct Elimination
When you apply a DISTINCT operator to a column marked as a PRIMARY KEY with RELY, the optimizer knows every value is already unique. It skips the expensive shuffle and sort required to get it.
SELECT DISTINCT p_product_id FROM products will be treated as a SELECT p_product_id FROM products, removing the overhead

Join Elimination
In a LEFT OUTER JOIN where you only select columns from the Fact table, the optimizer can eliminate the join if it knows the Dimension table is joined on a unique primary key.

SELECT SUM(oi_sales_price) 
FROM order_items oi
LEFT JOIN promotions p ON oi.promo_id = p.p_promo_id;

If p_promo_id is a RELY primary key, the optimizer realizes the join doesn't change the SUM result. It executes only the scan on the Fact table - SELECT SUM(oi_sales_price) FROM order_items oi;

View solution in original post

Sumit_7 · 9 hours ago

@AanchalSoni In Databricks, there's no concept of Primary/Foreign Key as Delta Tables are not a relational entity like the traditional RDMS -- just informational only. You may apply NOT NULL or set EXPECTATIONS in Lakeflow to control the execution and data quality.

balajij8 · 8 hours ago

Hi @AanchalSoni

Primary & Foreign Keys in Unity Catalog are Informational Only. Databricks does not enforce uniqueness during writes to avoid the massive performance overhead. You are responsible for maintaining data integrity in the pipeline logic. Table definition is just the blueprint for documentation and optimization. You can use below

MERGE: Instead of INSERT, use MERGE to upsert records as its the standard way to prevent duplicates in Delta Lake.
Deduplication: Use row number, over & partition by (over (partition by id order by timestamp desc) = 1) in your code before writing to the tables
Check Constraints: While Primary & Foreign Keys are not enforced, NOT NULL and CHECK constraints are strictly enforced.

AanchalSoni · 6 hours ago

Thanks @Sumit_7 @balajij8 for the clarification. I also read the same about primary keys and foreign keys, however got confused when I saw the table Overview. It distinctly highlights PK for customer_id. Please check the screenshot. Also, I was getting errors for other tables when they ran before customers table- 'primary key doesn't exist in customers table.'

What does informational mean?

emma_s · 6 hours ago

Hi,

If you using declarative pipelines then databricks respects the definition that you gave on the table declaration. If you subsequently do an alter table statement in the same pipeline it won't preserve it.

What you should do is either set the primary key in the declarative pipeline table definition or put the alter table as an additional task in the job. Although Databricks doesn't enforce primary key constraints it should still be aware of them.

CREATE OR REFRESH STREAMING TABLE orders_valid(
  CONSTRAINT pk_customer_id PRIMARY KEY (customer_id)
)
AS SELECT * FROM STREAM read_files("/databricks-datasets/retail-org/sales_orders");

Thanks,

Emma

AanchalSoni · 6 hours ago

Thanks @emma_s ! This helps!

balajij8 · 6 hours ago

@AanchalSoni

Capturing the columns as Primary key helps users and tools understand relationships in the data. You can create Primary Key with RELY for optimization in some cases by skipping redundant operations.

Distinct Elimination
When you apply a DISTINCT operator to a column marked as a PRIMARY KEY with RELY, the optimizer knows every value is already unique. It skips the expensive shuffle and sort required to get it.
SELECT DISTINCT p_product_id FROM products will be treated as a SELECT p_product_id FROM products, removing the overhead

Join Elimination
In a LEFT OUTER JOIN where you only select columns from the Fact table, the optimizer can eliminate the join if it knows the Dimension table is joined on a unique primary key.

SELECT SUM(oi_sales_price) 
FROM order_items oi
LEFT JOIN promotions p ON oi.promo_id = p.p_promo_id;

If p_promo_id is a RELY primary key, the optimizer realizes the join doesn't change the SUM result. It executes only the scan on the Fact table - SELECT SUM(oi_sales_price) FROM order_items oi;