cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

data quality check in data engineering

ankris
New Contributor III

Can we use deequ library with azure databricks ? if yes Please provide some support material or examples

Is there any similar data quality library or suggestion to achieve automatic data quality check during data engineering (Azure databricks)

Thanks in advance,

Anantha

 

6 REPLIES 6

daniel_sahal
Esteemed Contributor

@ankris Never used deequ, so cannot say anything about this library but I've used great expectations and it works really good.

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/9681...

ankris
New Contributor III

Thank you for prompt reply. I will go through it.                                                                                                               Meantime could anyone provide information on Deequ usage with azure databricks.                                                 Also any other library/package (equivalent to deequ) can be used in azure databricks.

 

BilalAslamDbrx
Databricks Employee
Databricks Employee

@ankris can you describe your data pipeline a bit? If you are writing in Delta Live Tables (my recommendation) then you can express data quality checks "in flight" as the pipeline processes data. You can do post-ETL data quality checks (e.g. at the aggregate level) with Lakehouse Monitoring.

We have data pipeline build with delta table and orchestrated using ADF. On top of it we would like to build Data quality assessor using external web interface (already build with streamlit app).Also plan to pass some variables using external interface to evaluate data quality.   

CharlesReily
New Contributor III

In Databricks, you can install external libraries by going to the Clusters tab, selecting your cluster, and then adding the Maven coordinates for Deequ.

In your notebook or script, you need to create a Spark session with the Deequ library added as a dependency. This can be done using the spark.jars.packages configuration option.

spark.conf.set("spark.jars.packages", "com.amazon.deequ:deequ:1.4.0")

Write your data quality checks using Deequ functions. For example:

import com.amazon.deequ.{VerificationSuite, VerificationResult}
import com.amazon.deequ.VerificationSuite._

val verificationResult: VerificationResult = VerificationSuite()
.onData(yourDataFrame)
.addCheck(
check = Check(yourColumn, "yourConstraint") // Define your data quality constraint here
)
.run()

joarobles
New Contributor III

Hi there! 

You could also take a look at Rudol, it enables no-code Data Quality validations to enable non-technical roles such as Business Analysts or Data Stewards to configure quality checks by themselves. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group