Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-08-2026 06:34 AM - edited 01-08-2026 06:34 AM
There are 2 fixes that I can think off
Option A: Make first_value deterministic
first_value(Customer_ID, true) OVER (
PARTITION BY customer_name
ORDER BY submitted ASC, event_id ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
Use a timestamp submitted, not date()
Add a stable tiebreaker (event_id, record id, etc.)
true → ignores NULLs
Explicit ROWS frame avoids Spark’s default RANGE behaviour with ties
Option B : Use row_number() instead
If you only need the “first” row deterministically:
row_number() OVER ( PARTITION BY customer_name ORDER BY submitted_ts ASC, event_id ASC )Then select or propagate the value from rn = 1.