Re: Using great expectations with autolaoder

Anonymous · ‎03-17-2023

@Chhaya Vishwakarma :

One alternative you can consider is to perform the data quality checks as part of your streaming pipeline, using Spark's built-in validation features. You can use assertions such as isNotNull() or isin() to check the quality of the data before writing it to the next layer in the pipeline. For example, you can add the following code to your bronze to silver transformation: (Please verify and modify the code to suit your use case)

silver_df = bronze_df.withColumn("date1", to_date(col("date1"), "yyyyDDD"))\
                     .withColumn("date2", to_date(col("date2"), "yyyyDDD"))\
                     .withColumn("date3", to_date(col("date3"), "MMddyy"))\
                     .filter(col("col1").isNotNull())\
                     .filter(col("col2").isin([1,6]))

This will filter out any rows where col1 is null or col2 is not in the set [1,6]. You can similarly add additional quality checks as needed.

Please remember to upvote the best answer that helped you! Asides if you need more follow ups, do reply on the thread, happy to circle back again.