cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
cancel
Showing results for 
Search instead for 
Did you mean: 

data quality check in data engineering

ankris
New Contributor III

Can we use deequ library with azure databricks ? if yes Please provide some support material or examples

Is there any similar data quality library or suggestion to achieve automatic data quality check during data engineering (Azure databricks)

Thanks in advance,

Anantha

 

5 REPLIES 5

daniel_sahal
Esteemed Contributor

@ankris Never used deequ, so cannot say anything about this library but I've used great expectations and it works really good.

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/9681...

ankris
New Contributor III

Thank you for prompt reply. I will go through it.                                                                                                               Meantime could anyone provide information on Deequ usage with azure databricks.                                                 Also any other library/package (equivalent to deequ) can be used in azure databricks.

 

BilalAslamDbrx
Honored Contributor II
Honored Contributor II

@ankris can you describe your data pipeline a bit? If you are writing in Delta Live Tables (my recommendation) then you can express data quality checks "in flight" as the pipeline processes data. You can do post-ETL data quality checks (e.g. at the aggregate level) with Lakehouse Monitoring.

We have data pipeline build with delta table and orchestrated using ADF. On top of it we would like to build Data quality assessor using external web interface (already build with streamlit app).Also plan to pass some variables using external interface to evaluate data quality.   

CharlesReily
New Contributor III

In Databricks, you can install external libraries by going to the Clusters tab, selecting your cluster, and then adding the Maven coordinates for Deequ.

In your notebook or script, you need to create a Spark session with the Deequ library added as a dependency. This can be done using the spark.jars.packages configuration option.

spark.conf.set("spark.jars.packages", "com.amazon.deequ:deequ:1.4.0")

Write your data quality checks using Deequ functions. For example:

import com.amazon.deequ.{VerificationSuite, VerificationResult}
import com.amazon.deequ.VerificationSuite._

val verificationResult: VerificationResult = VerificationSuite()
.onData(yourDataFrame)
.addCheck(
check = Check(yourColumn, "yourConstraint") // Define your data quality constraint here
)
.run()

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.