Re: Best way to generate fake data using underlyin...

savlahanish27 · ‎06-22-2026

The core problem you're facing is that Delta Lake doesn't enforce foreign key constraints, so most datagen tools generate each table independently and your joins produce no meaningful overlap.

The solution is to generate a shared key pool first - a simple list of IDs for each dimension entity like products, stores, and customers - and then have every table draw its foreign key columns from that same pool. This guarantees that when your pipeline joins the tables, the keys actually match