cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Assistance needed on DQX framework as we are referring GitHub resource but not enough details

shubham_007
Contributor III

Hi Community Experts,

I hope this message finds you well. Our team is currently working on enhancing data quality within our Databricks environment and we are utilizing the Databricks DQX framework for this purpose. We are seeking detailed guidance and support on the following aspects of the DQX framework:

  • Best practices for implementing data quality checks.
  • Detailed documentation on setting up and configuring the framework.
  • Examples of common data quality rules and how to apply them.
  • Troubleshooting tips for common issues encountered during implementation.

Your expertise and any resources you can provide would be greatly appreciated

7 REPLIES 7

Brahmareddy
Honored Contributor III

Hi shubham,

How are you doing today?, It’s great to see your team focusing on data quality using the DQX framework—it’s a solid tool for keeping your data clean and reliable. To get started, I’d suggest beginning with simple checks like NOT NULL, IN RANGE, or UNIQUE validations, and apply them early in your pipeline (ideally right after raw ingestion). You can find setup guidance and rule examples in the official DQX GitHub repo, though it’s a bit light on detailed docs, so testing in a dev environment first really helps. When applying rules, make sure your table and column names match exactly, and double-check data types to avoid silent issues. If you run into problems, the cluster logs usually give clues—especially for schema mismatches. Also, consider structuring your checks in a reusable format (like YAML or functions) to make it easier to scale across pipelines. Let me know if you want help with a simple template or sample use case—happy to help!

Regards,

Brahma

Hi Brahma,

Github link is not enough to give details on DQX. Any more material or resources you can recommend will be useful as we prepare. By the way, we have tested NOT NULL, IN RANGE and UNIQUE checks, it works. We need more help on SQL EXPRESSION, REGEX, DATE RANGE and IN FUTURE checks. 

Brahmareddy
Honored Contributor III

Hello shubham,

Great to hear you’ve already got the basic checks working! You're right—the GitHub link doesn’t provide much detail for more advanced rules like SQL expressions or regex. For these, I’d suggest using the SQL_EXPRESSION type in your rule config, which lets you write custom Spark SQL logic. For example, for a regex check on emails, you can use something like "sql_expression": "email RLIKE '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\\\.[a-zA-Z]{2,}$'". For date checks, you can compare against the current date like "sql_expression": "event_date <= current_date()" or "event_date > current_date()" for future dates. These expressions can be tested directly in a notebook before adding them to your DQX config. If you'd like, I can help you put together a few sample rules—just let me know the exact columns and checks you're trying to build. Happy to help!

Happy Weekend!

Regards,

Brahma

Sure Brahma !! As I am working on Healthcare Dataset, Important Columns names are as below (Expected Checks mentioned in round bracket) :

1) Date of Admission ( is_valid_date, is_in_range, is_older_than_n_days,is_valid_date, is_in_range, is_older_than_n_days, is_not_in_near_future )

2) Discharge Date ( is_valid_date, is_in_range, is_older_than_n_days, is_not_in_near_future )

3) Billing Amount ( sql_expression )

4) Medical Condition ( sql_expression / regex_match )

These are important and crucial data quality rules on which I am working. I am trying input these rules using DQX classes and YAML format but output is not getting as expected. Your help is appreciated !! 

@Brahmareddy  Hi brahma, awaiting for your response.

Brahmareddy
Honored Contributor III

Hi shubham_007,

How are you doing today? Sorry for the delay. Thanks for sharing those column details—they’re perfect for building meaningful data quality checks! For fields like Date of Admission and Discharge Date, you can use SQL expressions in DQX to check if the date is valid, within a realistic range, older than a certain number of days, or not in the near future. For example, you can write a rule like "sql_expression": "datediff(current_date(), DateOfAdmission) > 7" to check if the date is more than 7 days old. For Billing Amount, a good check could be something like "sql_expression": "BillingAmount >= 0 AND BillingAmount < 100000" to make sure the value isn’t negative or unusually high. And for Medical Condition, you can use a regex to allow only letters and spaces, like "sql_expression": "MedicalCondition RLIKE '^[A-Za-z ]+$'". Make sure to test these expressions in a notebook before adding them to your YAML file, and watch for proper indentation and quotation—YAML can be fussy! Let me know if you’d like me to help write a sample YAML file using these rules—I’d be glad to help.

Regards,

Brahma

Hi @Brahmareddy , Thanks for the details. Can you please assist how these quality checks develop using DQX Classes ?

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now