topic Re: DQ-Quality Check- what are the best method to validate the two parquet files . in Data Engineering

DQ-Quality Check- what are the best method to validate the two parquet files .

rameshybr — Fri, 16 Aug 2024 05:56:23 GMT

DQ-Quality Check. we have to validate the data between landing data and bronze data with data quality . below are the data quality checks.

1. find the counts between the 2 files. if it is matched then go for 2 point.

2. if counts are matched, then validate the data row by row as per keys . if keys are matched, then validate the data between the other columns. if the columns are not matched then store in error log file.

what is best methodology we can go for in pyspark(databricks).

Re: DQ-Quality Check- what are the best method to validate the two parquet files .

-werners- — Wed, 21 Aug 2024 13:47:50 GMT

what you are looking for is except and exceptAll.
f.e. df1.except(df2)
it returns the data of df1 that has no match in df2.

Re: DQ-Quality Check- what are the best method to validate the two parquet files .

rameshybr — Wed, 21 Aug 2024 15:18:48 GMT

Thanks Werners. will it provide the good performance?

Re: DQ-Quality Check- what are the best method to validate the two parquet files .

-werners- — Thu, 22 Aug 2024 07:12:37 GMT

It does use spark. But of course it is an expensive operation as all records are compared.
In my experience the performance is reasonable.

Re: DQ-Quality Check- what are the best method to validate the two parquet files .

Rishabh-Pandey — Thu, 22 Aug 2024 07:34:00 GMT

Try with this , this is for second point if first points already matches .

# Define key columns key_columns = ["key_column1", "key_column2"] # Adjust according to your data schema # Perform an outer join to find mismatches joined_df = landing_df.alias("landing").join( bronze_df.alias("bronze"), on=key_columns, how="outer" )