cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

metadata driven DQ validation for multiple tables dynamically

subha2
New Contributor II

There are multiple tables in the config/metadata table. These tables need to be
validated for DQ rules.
1.Natural Key / Business Key /Primary Key cannot be null or
blank.
2.Natural Key/Primary Key cannot be duplicate.
3.Join columns missing values
4.Business specific rule
How do we validate the above rules dynamically for the
tables configured in the metadata driven table.
Please suggest using Pyspark

How the data validation rule engine need to be build using pyspark.
Example
There are
2 tables Employee & Address.
So Some generic function will be written which will
verify the EmployeeId not null/blank and for Address table Addressline1 cannot be
null/blank

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @subha2To dynamically validate the data quality (DQ) rules for tables configured in a metadata-driven system using PySpark, you can follow these steps:

  1. Define Metadata for Tables:

    • First, create a metadata configuration that describes the rules for each table. This metadata should include information about natural keys, primary keys, join columns, and any other business-specific rules.
  2. Load Table Data:

    • Load the data for each table into a PySpark DataFrame.
  3. Implement DQ Validation Rules:

    • Write a generic function that takes the table name and the DataFrame as input.
    • Inside this function, apply the following validation rules:
      • Check if natural keys, business keys, and primary keys are not null or blank.
      • Ensure that natural keys/primary keys are not duplicated.
      • Validate join columns for missing values.
  4. Customize the FunctionExtend the validate_dq_rules function to handle additional rules based on your metadata configuration.

This approach allows you to dynamically validate DQ rules for any configured table using PySpark. 😊

subha2
New Contributor II

Could you share some sample examples. 

What should be structure of the Rule engine table. 

How do we create function/method which takes column or combinations of columns dynamic and get passed to function and return dataframe