cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Result Difference Between View and Manually Run View Query

ChrisRose
Visitor

I am experiencing an issue where a view does not display the correct results, but running the view query manually in either a new notebook or the SQL Editor displays different, correct results. I have tried switching the compute resource in the notebook to match the view creation job's SQL Warehouse and I have also restarted the SQL Warehouse with the same issue still existing. I also have dropped the view, created a new view with the same code and the results are still incorrect.

1 ACCEPTED SOLUTION

Accepted Solutions

What youโ€™re seeing is very consistent with first_value being non-deterministic when used as a window function unless your window ordering is fully deterministic and you handle NULLs explicitly. Databricksโ€™ own function docs call out that first_value is non-deterministichttps://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/first_value

In your case there are two common โ€œgotchasโ€ that explain โ€œworks in SQL editor / notebook, but sometimes NULL in a jobโ€:

1) Ties introduced by date(submitted)

Youโ€™re ordering by date(submitted) (day precision). If a customer has multiple rows on the same day, then there are ties in the ORDER BY. With ties, the โ€œfirst rowโ€ within the day is not guaranteed unless you add a tiebreaker โ€” different execution plans (interactive vs job) can pick a different physical row, and if the earliest physical row has Customer_ID = NULL, youโ€™ll get NULL.

Fix: order by a true timestamp + a unique tiebreaker (event id / ingestion id / primary key).

2) NULL handling

By default first_value(expr) respects nulls, so if the first row (per ordering) has Customer_ID NULL, the result is NULL. Databricks supports an ignoreNull boolean argument (or IGNORE NULLS semantics in the docs) for first_value

View solution in original post

6 REPLIES 6

ckunal_eng
New Contributor

@ChrisRose  Can you please elaborate on the part when you say the view does not show correct results. Are you performing any computation or aggregation. Can you share the view query if possible.

K_Anudeep
Databricks Employee
Databricks Employee

Hello @ChrisRose ,

The only reason i could think of why this could happen is because of CACHE ,but in that case, restarting the warehouse should have solved the issue. Could you provide steps to reproduce this issue so that i can check internally

Anudeep

The issue is with a calculated field using the first_value aggregate function invoked as a window function with the OVER clause - 

first_value(Customer_ID) OVER (PARTITION BY customer_name ORDER BY date(submitted) ASC) as Customer_ID.
 
The field calculates correctly in a notebook or SQL Editor, but when run in a job via notebook the value comes back null for a particular record. It does not happen for all records. Please let me know if that is enough information, or if you need more to go on.

What youโ€™re seeing is very consistent with first_value being non-deterministic when used as a window function unless your window ordering is fully deterministic and you handle NULLs explicitly. Databricksโ€™ own function docs call out that first_value is non-deterministichttps://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/first_value

In your case there are two common โ€œgotchasโ€ that explain โ€œworks in SQL editor / notebook, but sometimes NULL in a jobโ€:

1) Ties introduced by date(submitted)

Youโ€™re ordering by date(submitted) (day precision). If a customer has multiple rows on the same day, then there are ties in the ORDER BY. With ties, the โ€œfirst rowโ€ within the day is not guaranteed unless you add a tiebreaker โ€” different execution plans (interactive vs job) can pick a different physical row, and if the earliest physical row has Customer_ID = NULL, youโ€™ll get NULL.

Fix: order by a true timestamp + a unique tiebreaker (event id / ingestion id / primary key).

2) NULL handling

By default first_value(expr) respects nulls, so if the first row (per ordering) has Customer_ID NULL, the result is NULL. Databricks supports an ignoreNull boolean argument (or IGNORE NULLS semantics in the docs) for first_value

ckunal_eng
New Contributor

@ChrisRose  to add to whatever has been said, nondeterministic ordering seems to be reason. You can check using this query : 

SELECT
customer_name,
DATE(submitted),
COUNT(*)
FROM your_view
GROUP BY customer_name, DATE(submitted)
HAVING COUNT(*) > 1

If the earliest submitted date for that customer_name has customer_id = NULL, then SQL editor/notebook may appear correct, hwoever since job runs show full data if the first row is NULL, result is also NULL.

You can confirm using this query:

SELECT
customer_name,
submitted,
Customer_ID
FROM your_view
WHERE customer_name = '<problem_customer>'
ORDER BY DATE(submitted) ASC

Additionally to resolve this you can add a tie breaker to the query as given below:

FIRST_VALUE(Customer_ID IGNORE NULLS)
OVER (
PARTITION BY customer_name
ORDER BY DATE(submitted) ASC, submitted ASC, record_id ASC
) AS Customer_ID

 

 

bianca_unifeye
Contributor

There are 2 fixes that I can think off

 

Option A:  Make first_value deterministic

 

 
first_value(Customer_ID, true) OVER (
  PARTITION BY customer_name
  ORDER BY submitted ASC, event_id ASC
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)

 

  • Use a timestamp submitted, not date()

  • Add a stable tiebreaker (event_id, record id, etc.)

  • true โ†’ ignores NULLs

  • Explicit ROWS frame avoids Sparkโ€™s default RANGE behaviour with ties

    Option B : Use row_number() instead

    If you only need the โ€œfirstโ€ row deterministically:

    row_number() OVER (
      PARTITION BY customer_name
      ORDER BY submitted_ts ASC, event_id ASC
    )

    Then select or propagate the value from rn = 1.