What you’re seeing is very consistent with first_value being non-deterministic when used as a window function unless your window ordering is fully deterministic and you handle NULLs explicitly. Databricks’ own function docs call out that first_value is non-deterministichttps://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/first_value

In your case there are two common “gotchas” that explain “works in SQL editor / notebook, but sometimes NULL in a job”:

1) Ties introduced by date(submitted)

You’re ordering by date(submitted) (day precision). If a customer has multiple rows on the same day, then there are ties in the ORDER BY. With ties, the “first row” within the day is not guaranteed unless you add a tiebreaker — different execution plans (interactive vs job) can pick a different physical row, and if the earliest physical row has Customer_ID = NULL, you’ll get NULL.

Fix: order by a true timestamp + a unique tiebreaker (event id / ingestion id / primary key).

2) NULL handling

By default first_value(expr) respects nulls, so if the first row (per ordering) has Customer_ID NULL, the result is NULL. Databricks supports an ignoreNull boolean argument (or IGNORE NULLS semantics in the docs) for first_value

View solution in original post