cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DQ-Quality Check- what are the best method to validate the two parquet files .

rameshybr
New Contributor II

DQ-Quality Check. we have to validate the data between landing data and bronze data with data quality . below are the data quality checks. 

 1. find the counts between the 2 files. if it is matched then go for 2 point.

2. if counts are matched, then validate the data row by row as per keys . if keys are matched, then validate the data between the other columns. if the columns are not matched then store in error log file.

what is best methodology we can go for in pyspark(databricks).

4 REPLIES 4

-werners-
Esteemed Contributor III

what you are looking for is except and exceptAll.
f.e. df1.except(df2)
it returns the data of df1 that has no match in df2.

rameshybr
New Contributor II

Thanks Werners. will it provide the good performance?

-werners-
Esteemed Contributor III

It does use spark. But of course it is an expensive operation as all records are compared.
In my experience the performance is reasonable.

Rishabh-Pandey
Esteemed Contributor

Try with this , this is for second point if first points already matches .

# Define key columns
key_columns = ["key_column1", "key_column2"]  # Adjust according to your data schema

# Perform an outer join to find mismatches
joined_df = landing_df.alias("landing").join(
    bronze_df.alias("bronze"),
    on=key_columns,
    how="outer"
)

 

Rishabh Pandey

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group