7 hours ago
I've created a Lakeflow job to run 5 notebook tasks, one for each silver table- Customers, Accounts, Transactions, Loans and Branches.
In Customers notebook, after writing the data to delta table using auto loader, I'm applying the non null and primary key constraint on customer_id:
3 hours ago
Capturing the columns as Primary key helps users and tools understand relationships in the data. You can create Primary Key with RELY for optimization in some cases by skipping redundant operations.
Distinct Elimination
When you apply a DISTINCT operator to a column marked as a PRIMARY KEY with RELY, the optimizer knows every value is already unique. It skips the expensive shuffle and sort required to get it.
SELECT DISTINCT p_product_id FROM products will be treated as a SELECT p_product_id FROM products, removing the overhead
Join Elimination
In a LEFT OUTER JOIN where you only select columns from the Fact table, the optimizer can eliminate the join if it knows the Dimension table is joined on a unique primary key.
SELECT SUM(oi_sales_price)
FROM order_items oi
LEFT JOIN promotions p ON oi.promo_id = p.p_promo_id;
If p_promo_id is a RELY primary key, the optimizer realizes the join doesn't change the SUM result. It executes only the scan on the Fact table - SELECT SUM(oi_sales_price) FROM order_items oi;
6 hours ago
@AanchalSoni In Databricks, there's no concept of Primary/Foreign Key as Delta Tables are not a relational entity like the traditional RDMS -- just informational only. You may apply NOT NULL or set EXPECTATIONS in Lakeflow to control the execution and data quality.
4 hours ago
Hi @AanchalSoni
Primary & Foreign Keys in Unity Catalog are Informational Only. Databricks does not enforce uniqueness during writes to avoid the massive performance overhead. You are responsible for maintaining data integrity in the pipeline logic. Table definition is just the blueprint for documentation and optimization. You can use below
3 hours ago
Thanks @Sumit_7 @balajij8 for the clarification. I also read the same about primary keys and foreign keys, however got confused when I saw the table Overview. It distinctly highlights PK for customer_id. Please check the screenshot. Also, I was getting errors for other tables when they ran before customers table- 'primary key doesn't exist in customers table.'
What does informational mean?
3 hours ago
Hi,
If you using declarative pipelines then databricks respects the definition that you gave on the table declaration. If you subsequently do an alter table statement in the same pipeline it won't preserve it.
What you should do is either set the primary key in the declarative pipeline table definition or put the alter table as an additional task in the job. Although Databricks doesn't enforce primary key constraints it should still be aware of them.
CREATE OR REFRESH STREAMING TABLE orders_valid(
CONSTRAINT pk_customer_id PRIMARY KEY (customer_id)
)
AS SELECT * FROM STREAM read_files("/databricks-datasets/retail-org/sales_orders");
Thanks,
Emma
3 hours ago
Thanks @emma_s ! This helps!
3 hours ago
Capturing the columns as Primary key helps users and tools understand relationships in the data. You can create Primary Key with RELY for optimization in some cases by skipping redundant operations.
Distinct Elimination
When you apply a DISTINCT operator to a column marked as a PRIMARY KEY with RELY, the optimizer knows every value is already unique. It skips the expensive shuffle and sort required to get it.
SELECT DISTINCT p_product_id FROM products will be treated as a SELECT p_product_id FROM products, removing the overhead
Join Elimination
In a LEFT OUTER JOIN where you only select columns from the Fact table, the optimizer can eliminate the join if it knows the Dimension table is joined on a unique primary key.
SELECT SUM(oi_sales_price)
FROM order_items oi
LEFT JOIN promotions p ON oi.promo_id = p.p_promo_id;
If p_promo_id is a RELY primary key, the optimizer realizes the join doesn't change the SUM result. It executes only the scan on the Fact table - SELECT SUM(oi_sales_price) FROM order_items oi;
3 hours ago
@balajij8 thanks again!
The explanation totally makes sense now.