bianca_unifeye
Databricks MVP

There are 2 fixes that I can think off

 

Option A:  Make first_value deterministic

 

 
first_value(Customer_ID, true) OVER (
  PARTITION BY customer_name
  ORDER BY submitted ASC, event_id ASC
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)

 

  • Use a timestamp submitted, not date()

  • Add a stable tiebreaker (event_id, record id, etc.)

  • true → ignores NULLs

  • Explicit ROWS frame avoids Spark’s default RANGE behaviour with ties

    Option B : Use row_number() instead

    If you only need the “first” row deterministically:

    row_number() OVER (
      PARTITION BY customer_name
      ORDER BY submitted_ts ASC, event_id ASC
    )

    Then select or propagate the value from rn = 1.