Databricks

Nick_Hughes · ‎05-16-2023

Hi

We are trying to generate fake data to run our tests. For example, we have a pipeline that creates a gold layer fact table form 6 underlying source tables in our silver layer. We want to generate the data in a way that recognises the relationships between the tables so that when we run actual versus expected tests, the results match as without this referential integrity being adhered to, the results never will reconcile. We want records richer than the datagen tool which seemingly relies a lot on generating 0s and 1s ( https://github.com/databrickslabs/dbldatagen ). Has anyone had any success in implementing such a solution? Obviously with the lack of primary and foreign keys in DDLs it's hard to generate data like you would in an RDBMS. We're also looking at generation methods such as hackolade and Idera and other tools that allow you to generate it manually from a model but we want this to be generated as part of a CICD pipeline.

User16502773013 · ‎06-07-2023

Hello Nick,

I think the best approach in this case is to build the main tables that has primary keys using data generators dbldatagen or other data generators then build the tables that need refrerential integrity from these base table

At a high level you can create fake table/data

as: Table_A: PK_Col, Col1, Col2, col3

Table_B: : PK_Col, Col1, Col2, col3

.......

Then can create a derived table Table_C that is correlated to table_A by limiting its foreign key to Table_A PK (PK_Col in our example)

This way we will have correlated data that has referential integrity included

Regards

Anonymous · ‎06-21-2023

Hi @Nick Hughes

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

RonanStokes_DB · ‎07-13-2023

Hi @Nick_Hughes
This may be late for your scenario - but hopefully others facing similar issues will find it useful.

You can specify how data is generated in `dbldatagen` using rules in the data generation spec. If rules are specified for data generation, you can generate a wide variety of data.

The document (located at https://databrickslabs.github.io/dbldatagen/public_docs/index.html) gives many examples.

There is an example similar to what you are requesting (multi tables with cross referencing) located at: https://databrickslabs.github.io/dbldatagen/public_docs/multi_table_data.html

If you are seeing only 1s and 0s when rules are specified, or simply have suggestions on ways to improve this - feel free to open an issue in the issues page of the main GitHub site at https://github.com/databrickslabs/dbldatagen

Regards Ronan Stokes (Databricks)

Databricks

Best way to generate fake data using underlying schema

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs