cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best way to generate fake data using underlying schema

Nick_Hughes
New Contributor III

Hi

We are trying to generate fake data to run our tests. For example, we have a pipeline that creates a gold layer fact table form 6 underlying source tables in our silver layer. We want to generate the data in a way that recognises the relationships between the tables so that when we run actual versus expected tests, the results match as without this referential integrity being adhered to, the results never will reconcile. We want records richer than the datagen tool which seemingly relies a lot on generating 0s and 1s ( https://github.com/databrickslabs/dbldatagen ). Has anyone had any success in implementing such a solution? Obviously with the lack of primary and foreign keys in DDLs it's hard to generate data like you would in an RDBMS. We're also looking at generation methods such as hackolade and Idera and other tools that allow you to generate it manually from a model but we want this to be generated as part of a CICD pipeline.

3 REPLIES 3

User16502773013
Databricks Employee
Databricks Employee

Hello Nick,

I think the best approach in this case is to build the main tables that has primary keys using data generators dbldatagen or other data generators then build the tables that need refrerential integrity from these base table

At a high level you can create fake table/data

as: Table_A: PK_Col, Col1, Col2, col3

Table_B: : PK_Col, Col1, Col2, col3

.......

Then can create a derived table Table_C that is correlated to table_A by limiting its foreign key to Table_A PK (PK_Col in our example)

This way we will have correlated data that has referential integrity included

Regards

Anonymous
Not applicable

Hi @Nick Hughes​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

RonanStokes_DB
Databricks Employee
Databricks Employee

Hi @Nick_Hughes 
This may be late for your scenario - but hopefully others facing similar issues will find it useful.

You can specify how data is generated in `dbldatagen` using rules in the data generation spec. If rules are specified for data generation, you can generate a wide variety of data. 

The document (located at https://databrickslabs.github.io/dbldatagen/public_docs/index.html) gives many examples. 

There is an example similar to what you are requesting (multi tables with cross referencing) located at: https://databrickslabs.github.io/dbldatagen/public_docs/multi_table_data.html

If you are seeing only 1s and 0s when rules are specified, or simply have suggestions on ways to improve this - feel free to open an issue in the issues page of the main GitHub site at https://github.com/databrickslabs/dbldatagen

Regards Ronan Stokes (Databricks)

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group