cancel
Showing results for 
Search instead for 
Did you mean: 
Data Governance
Join discussions on data governance practices, compliance, and security within the Databricks Community. Exchange strategies and insights to ensure data integrity and regulatory compliance.
cancel
Showing results for 
Search instead for 
Did you mean: 

Issue in DQX DQGeneration with AI - DQrules generated with sql_query function having syntax errors.

Dharmin
New Contributor

While using DQX to generate DQ rules on a table i have used the following prompt.

from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.config import InputConfig
from databricks.sdk import WorkspaceClient

ws = WorkspaceClient()
generator = DQGenerator(workspace_client=ws, spark=spark)
# Step 1: Profile the data to generate summary statistics
profiler = DQProfiler(workspace_client=ws, spark=spark)
# Generate rules with input data schema awareness
input_df = spark.read.table(f"{schema}.{tablename}")
summary_stats, profiles = profiler.profile(input_df)
user_input = """
Make sure tid and tdescr are one to one mapped
make sure l and ldesc are one to one mapped
make sure effective startdate and effective enddate are lesser than current date
Make sure the test cases are executable through dbx and should not throw any errors like table or view not found
"""
# Option A: With business context
checks = generator.generate_dq_rules_ai_assisted(
    user_input=user_input,
    summary_stats=summary_stats
)

print(checks)
 
For the above code i am getting some valid response but the response is not as directly usable.
 

{'criticality': 'error', 'check': {'function': 'sql_query', 'arguments': {'query': 'SELECT TID, COUNT(DISTINCT TIDDESCR) as desc_count FROM input_view GROUP BY TID HAVING COUNT(DISTINCT TIDDESCR) > 1', 'name': 'tid_to_description_one_to_one', 'msg': 'CTID must map to exactly one TIDDESCR'}}}
 
I tried to correct it as below.
 
- check:
    arguments:
      merge_columns:
      - TID
      - TIDDESCR
      msg: TID has multiple TIDDESCR values
      input_placeholder: input_view
      condition_column: desc_count
      name: tid_tiddescr_one_to_one
      query: with main_cte as (SELECT TID, TIDDESCR, COUNT(DISTINCT TIDDESCR) AS desc_count
        FROM {{ input_view }} WHERE TID IS NOT NULL GROUP BY TID, TIDDESCR
        HAVING COUNT(DISTINCT TIDDESCR) > 1)select *,case when desc_count > 1 then true else false end as condition from main_cte
    function: sql_query
  criticality: error
- check:
arguments:
merge_columns:
- l
- lDESC
input_placeholder: input_view
msg: l has multiple lDESC values
name: l_ldesc_one_to_one
condition_column: condition
query: with main_cte as (SElECT l, lDESC, COUNT(DISTINCT lDESC) AS desc_count
FROM {{ input_view }} WHERE l IS NOT NUll GROUP BY l, lDESC
HAVING COUNT(DISTINCT lDESC) > 1)select *,case when desc_count > 1 then true else false end as condition from main_cte
function: sql_query
criticality: error
Still  facing the following issue
{"ts": "2026-03-13 06:46:43.193", "level": "ERROR", "logger": "DataFrameQueryContextLogger", "msg": "[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve \"CASE WHEN Tid_Tiddescr_one_to_one_desc_count_de5f40594692464892443d2e0412aaef THEN TID has multiple TIDDESCR values ELSE CAST(NULL AS STRING) END\" due to data type mismatch: The first parameter requires the \"BOOLEAN\" type, however \"Tid_Tiddescr_one_to_one_desc_count_de5f40594692464892443d2e0412aaef\" has the type \"BIGINT\". SQLSTATE: 42K09;\n'Project [TID#1233, TIDDESCR#1234, l#1235, lDESC#1236,
 
Can some one guide me on what chould be the issue.
I need help on the following things
1.Is there any possibility to get better response from the LLM model. Any better prompt etc.?
2.Why is the sql_query function not working?
0 REPLIES 0