Best way to generate fake data using underlying schema
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2023 03:43 AM
Hi
We are trying to generate fake data to run our tests. For example, we have a pipeline that creates a gold layer fact table form 6 underlying source tables in our silver layer. We want to generate the data in a way that recognises the relationships between the tables so that when we run actual versus expected tests, the results match as without this referential integrity being adhered to, the results never will reconcile. We want records richer than the datagen tool which seemingly relies a lot on generating 0s and 1s ( https://github.com/databrickslabs/dbldatagen ). Has anyone had any success in implementing such a solution? Obviously with the lack of primary and foreign keys in DDLs it's hard to generate data like you would in an RDBMS. We're also looking at generation methods such as hackolade and Idera and other tools that allow you to generate it manually from a model but we want this to be generated as part of a CICD pipeline.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-07-2023 07:39 PM
Hello Nick,
I think the best approach in this case is to build the main tables that has primary keys using data generators dbldatagen or other data generators then build the tables that need refrerential integrity from these base table
At a high level you can create fake table/data
as: Table_A: PK_Col, Col1, Col2, col3
Table_B: : PK_Col, Col1, Col2, col3
.......
Then can create a derived table Table_C that is correlated to table_A by limiting its foreign key to Table_A PK (PK_Col in our example)
This way we will have correlated data that has referential integrity included
Regards

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-21-2023 12:06 AM
Hi @Nick Hughes
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-13-2023 11:14 AM
Hi @Nick_Hughes
This may be late for your scenario - but hopefully others facing similar issues will find it useful.
You can specify how data is generated in `dbldatagen` using rules in the data generation spec. If rules are specified for data generation, you can generate a wide variety of data.
The document (located at https://databrickslabs.github.io/dbldatagen/public_docs/index.html) gives many examples.
There is an example similar to what you are requesting (multi tables with cross referencing) located at: https://databrickslabs.github.io/dbldatagen/public_docs/multi_table_data.html
If you are seeing only 1s and 0s when rules are specified, or simply have suggestions on ways to improve this - feel free to open an issue in the issues page of the main GitHub site at https://github.com/databrickslabs/dbldatagen
Regards Ronan Stokes (Databricks)

