How do I get a cartesian product of a huge dataset?

rlgarris
Databricks Employee
Databricks Employee

A cartesian product is a common operation to get the cross product of two tables.

For example, say you have a list of customers and a list of your product catalog and want to get the cross product of all customer - product combinations.

Cartesian products however can be a very expensive operation. Even with as little as 6,000 products and 100,000 customers the output will be 600 million records (6K x 100K = 600M)