โ05-08-2025 08:47 AM
Hi Community Experts,
I hope this message finds you well. Our team is currently working on enhancing data quality within our Databricks environment and we are utilizing the Databricks DQX framework for this purpose. We are seeking detailed guidance and support on the following aspects of the DQX framework:
Your expertise and any resources you can provide would be greatly appreciated
โ05-08-2025 09:29 PM
Hi shubham,
How are you doing today?, Itโs great to see your team focusing on data quality using the DQX frameworkโitโs a solid tool for keeping your data clean and reliable. To get started, Iโd suggest beginning with simple checks like NOT NULL, IN RANGE, or UNIQUE validations, and apply them early in your pipeline (ideally right after raw ingestion). You can find setup guidance and rule examples in the official DQX GitHub repo, though itโs a bit light on detailed docs, so testing in a dev environment first really helps. When applying rules, make sure your table and column names match exactly, and double-check data types to avoid silent issues. If you run into problems, the cluster logs usually give cluesโespecially for schema mismatches. Also, consider structuring your checks in a reusable format (like YAML or functions) to make it easier to scale across pipelines. Let me know if you want help with a simple template or sample use caseโhappy to help!
Regards,
Brahma
โ05-09-2025 10:18 PM
Hi Brahma,
Github link is not enough to give details on DQX. Any more material or resources you can recommend will be useful as we prepare. By the way, we have tested NOT NULL, IN RANGE and UNIQUE checks, it works. We need more help on SQL EXPRESSION, REGEX, DATE RANGE and IN FUTURE checks.
โ05-10-2025 04:57 AM
Hello shubham,
Great to hear youโve already got the basic checks working! You're rightโthe GitHub link doesnโt provide much detail for more advanced rules like SQL expressions or regex. For these, Iโd suggest using the SQL_EXPRESSION type in your rule config, which lets you write custom Spark SQL logic. For example, for a regex check on emails, you can use something like "sql_expression": "email RLIKE '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\\\.[a-zA-Z]{2,}$'". For date checks, you can compare against the current date like "sql_expression": "event_date <= current_date()" or "event_date > current_date()" for future dates. These expressions can be tested directly in a notebook before adding them to your DQX config. If you'd like, I can help you put together a few sample rulesโjust let me know the exact columns and checks you're trying to build. Happy to help!
Happy Weekend!
Regards,
Brahma
โ05-12-2025 10:10 PM
Sure Brahma !! As I am working on Healthcare Dataset, Important Columns names are as below (Expected Checks mentioned in round bracket) :
1) Date of Admission ( is_valid_date, is_in_range, is_older_than_n_days,is_valid_date, is_in_range, is_older_than_n_days, is_not_in_near_future )
2) Discharge Date ( is_valid_date, is_in_range, is_older_than_n_days, is_not_in_near_future )
3) Billing Amount ( sql_expression )
4) Medical Condition ( sql_expression / regex_match )
These are important and crucial data quality rules on which I am working. I am trying input these rules using DQX classes and YAML format but output is not getting as expected. Your help is appreciated !!
โ05-16-2025 03:42 AM
@Brahmareddy Hi brahma, awaiting for your response.
โ05-16-2025 05:03 AM
Hi shubham_007,
How are you doing today? Sorry for the delay. Thanks for sharing those column detailsโtheyโre perfect for building meaningful data quality checks! For fields like Date of Admission and Discharge Date, you can use SQL expressions in DQX to check if the date is valid, within a realistic range, older than a certain number of days, or not in the near future. For example, you can write a rule like "sql_expression": "datediff(current_date(), DateOfAdmission) > 7" to check if the date is more than 7 days old. For Billing Amount, a good check could be something like "sql_expression": "BillingAmount >= 0 AND BillingAmount < 100000" to make sure the value isnโt negative or unusually high. And for Medical Condition, you can use a regex to allow only letters and spaces, like "sql_expression": "MedicalCondition RLIKE '^[A-Za-z ]+$'". Make sure to test these expressions in a notebook before adding them to your YAML file, and watch for proper indentation and quotationโYAML can be fussy! Let me know if youโd like me to help write a sample YAML file using these rulesโIโd be glad to help.
Regards,
Brahma
โ05-16-2025 05:18 AM
Hi @Brahmareddy , Thanks for the details. Can you please assist how these quality checks develop using DQX Classes ?