โ09-27-2023 02:35 AM
Can we use deequ library with azure databricks ? if yes Please provide some support material or examples
Is there any similar data quality library or suggestion to achieve automatic data quality check during data engineering (Azure databricks)
Thanks in advance,
Anantha
โ09-27-2023 10:46 PM
@ankris Never used deequ, so cannot say anything about this library but I've used great expectations and it works really good.
โ09-28-2023 10:43 PM
Thank you for prompt reply. I will go through it. Meantime could anyone provide information on Deequ usage with azure databricks. Also any other library/package (equivalent to deequ) can be used in azure databricks.
โ09-29-2023 02:03 AM
@ankris can you describe your data pipeline a bit? If you are writing in Delta Live Tables (my recommendation) then you can express data quality checks "in flight" as the pipeline processes data. You can do post-ETL data quality checks (e.g. at the aggregate level) with Lakehouse Monitoring.
โ10-16-2023 03:33 AM
We have data pipeline build with delta table and orchestrated using ADF. On top of it we would like to build Data quality assessor using external web interface (already build with streamlit app).Also plan to pass some variables using external interface to evaluate data quality.
โ01-16-2024 05:09 AM
In Databricks, you can install external libraries by going to the Clusters tab, selecting your cluster, and then adding the Maven coordinates for Deequ.
In your notebook or script, you need to create a Spark session with the Deequ library added as a dependency. This can be done using the spark.jars.packages configuration option.
spark.conf.set("spark.jars.packages", "com.amazon.deequ:deequ:1.4.0")
Write your data quality checks using Deequ functions. For example:
import com.amazon.deequ.{VerificationSuite, VerificationResult}
import com.amazon.deequ.VerificationSuite._
val verificationResult: VerificationResult = VerificationSuite()
.onData(yourDataFrame)
.addCheck(
check = Check(yourColumn, "yourConstraint") // Define your data quality constraint here
)
.run()
โ07-25-2024 08:36 AM
Hi there!
You could also take a look at Rudol, it enables no-code Data Quality validations to enable non-technical roles such as Business Analysts or Data Stewards to configure quality checks by themselves.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group