While using DQX to generate DQ rules on a table i have used the following prompt.
from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.config import InputConfig
from databricks.sdk import WorkspaceClient
ws = WorkspaceClient()
generator = DQGenerator(workspace_client=ws, spark=spark)
# Step 1: Profile the data to generate summary statistics
profiler = DQProfiler(workspace_client=ws, spark=spark)
# Generate rules with input data schema awareness
input_df = spark.read.table(f"{schema}.{tablename}")
summary_stats, profiles = profiler.profile(input_df)
user_input = """
Make sure tid and tdescr are one to one mapped
make sure l and ldesc are one to one mapped
make sure effective startdate and effective enddate are lesser than current date
Make sure the test cases are executable through dbx and should not throw any errors like table or view not found
"""
# Option A: With business context
checks = generator.generate_dq_rules_ai_assisted(
user_input=user_input,
summary_stats=summary_stats
)
print(checks)
For the above code i am getting some valid response but the response is not as directly usable.
{'criticality': 'error', 'check': {'function': 'sql_query', 'arguments': {'query': 'SELECT TID, COUNT(DISTINCT TIDDESCR) as desc_count FROM input_view GROUP BY TID HAVING COUNT(DISTINCT TIDDESCR) > 1', 'name': 'tid_to_description_one_to_one', 'msg': 'CTID must map to exactly one TIDDESCR'}}}
I tried to correct it as below.
- check:
arguments:
merge_columns:
- TID
- TIDDESCR
msg: TID has multiple TIDDESCR values
input_placeholder: input_view
condition_column: desc_count
name: tid_tiddescr_one_to_one
query: with main_cte as (SELECT TID, TIDDESCR, COUNT(DISTINCT TIDDESCR) AS desc_count
FROM {{ input_view }} WHERE TID IS NOT NULL GROUP BY TID, TIDDESCR
HAVING COUNT(DISTINCT TIDDESCR) > 1)select *,case when desc_count > 1 then true else false end as condition from main_cte
function: sql_query
criticality: error
- check:
arguments:
merge_columns:
- l
- lDESC
input_placeholder: input_view
msg: l has multiple lDESC values
name: l_ldesc_one_to_one
condition_column: condition
query: with main_cte as (SElECT l, lDESC, COUNT(DISTINCT lDESC) AS desc_count
FROM {{ input_view }} WHERE l IS NOT NUll GROUP BY l, lDESC
HAVING COUNT(DISTINCT lDESC) > 1)select *,case when desc_count > 1 then true else false end as condition from main_cte
function: sql_query
criticality: error
Still facing the following issue
{"ts": "2026-03-13 06:46:43.193", "level": "ERROR", "logger": "DataFrameQueryContextLogger", "msg": "[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve \"CASE WHEN Tid_Tiddescr_one_to_one_desc_count_de5f40594692464892443d2e0412aaef THEN TID has multiple TIDDESCR values ELSE CAST(NULL AS STRING) END\" due to data type mismatch: The first parameter requires the \"BOOLEAN\" type, however \"Tid_Tiddescr_one_to_one_desc_count_de5f40594692464892443d2e0412aaef\" has the type \"BIGINT\". SQLSTATE: 42K09;\n'Project [TID#1233, TIDDESCR#1234, l#1235, lDESC#1236,
Can some one guide me on what chould be the issue.
I need help on the following things
1.Is there any possibility to get better response from the LLM model. Any better prompt etc.?
2.Why is the sql_query function not working?