yesterday
I am experiencing an issue where a view does not display the correct results, but running the view query manually in either a new notebook or the SQL Editor displays different, correct results. I have tried switching the compute resource in the notebook to match the view creation job's SQL Warehouse and I have also restarted the SQL Warehouse with the same issue still existing. I also have dropped the view, created a new view with the same code and the results are still incorrect.
yesterday
What youโre seeing is very consistent with first_value being non-deterministic when used as a window function unless your window ordering is fully deterministic and you handle NULLs explicitly. Databricksโ own function docs call out that first_value is non-deterministic. https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/first_value
In your case there are two common โgotchasโ that explain โworks in SQL editor / notebook, but sometimes NULL in a jobโ:
Youโre ordering by date(submitted) (day precision). If a customer has multiple rows on the same day, then there are ties in the ORDER BY. With ties, the โfirst rowโ within the day is not guaranteed unless you add a tiebreaker โ different execution plans (interactive vs job) can pick a different physical row, and if the earliest physical row has Customer_ID = NULL, youโll get NULL.
Fix: order by a true timestamp + a unique tiebreaker (event id / ingestion id / primary key).
By default first_value(expr) respects nulls, so if the first row (per ordering) has Customer_ID NULL, the result is NULL. Databricks supports an ignoreNull boolean argument (or IGNORE NULLS semantics in the docs) for first_value
yesterday
@ChrisRose Can you please elaborate on the part when you say the view does not show correct results. Are you performing any computation or aggregation. Can you share the view query if possible.
yesterday
Hello @ChrisRose ,
The only reason i could think of why this could happen is because of CACHE ,but in that case, restarting the warehouse should have solved the issue. Could you provide steps to reproduce this issue so that i can check internally
yesterday
The issue is with a calculated field using the first_value aggregate function invoked as a window function with the OVER clause -
yesterday
What youโre seeing is very consistent with first_value being non-deterministic when used as a window function unless your window ordering is fully deterministic and you handle NULLs explicitly. Databricksโ own function docs call out that first_value is non-deterministic. https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/first_value
In your case there are two common โgotchasโ that explain โworks in SQL editor / notebook, but sometimes NULL in a jobโ:
Youโre ordering by date(submitted) (day precision). If a customer has multiple rows on the same day, then there are ties in the ORDER BY. With ties, the โfirst rowโ within the day is not guaranteed unless you add a tiebreaker โ different execution plans (interactive vs job) can pick a different physical row, and if the earliest physical row has Customer_ID = NULL, youโll get NULL.
Fix: order by a true timestamp + a unique tiebreaker (event id / ingestion id / primary key).
By default first_value(expr) respects nulls, so if the first row (per ordering) has Customer_ID NULL, the result is NULL. Databricks supports an ignoreNull boolean argument (or IGNORE NULLS semantics in the docs) for first_value
yesterday
@ChrisRose to add to whatever has been said, nondeterministic ordering seems to be reason. You can check using this query :
SELECT
customer_name,
DATE(submitted),
COUNT(*)
FROM your_view
GROUP BY customer_name, DATE(submitted)
HAVING COUNT(*) > 1
If the earliest submitted date for that customer_name has customer_id = NULL, then SQL editor/notebook may appear correct, hwoever since job runs show full data if the first row is NULL, result is also NULL.
You can confirm using this query:
SELECT
customer_name,
submitted,
Customer_ID
FROM your_view
WHERE customer_name = '<problem_customer>'
ORDER BY DATE(submitted) ASC
Additionally to resolve this you can add a tie breaker to the query as given below:
FIRST_VALUE(Customer_ID IGNORE NULLS)
OVER (
PARTITION BY customer_name
ORDER BY DATE(submitted) ASC, submitted ASC, record_id ASC
) AS Customer_ID
yesterday - last edited yesterday
There are 2 fixes that I can think off
first_value(Customer_ID, true) OVER (
PARTITION BY customer_name
ORDER BY submitted ASC, event_id ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
Use a timestamp submitted, not date()
Add a stable tiebreaker (event_id, record id, etc.)
true โ ignores NULLs
Explicit ROWS frame avoids Sparkโs default RANGE behaviour with ties
If you only need the โfirstโ row deterministically:
row_number() OVER (
PARTITION BY customer_name
ORDER BY submitted_ts ASC, event_id ASC
)Then select or propagate the value from rn = 1.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now