Hi everyone,
I've been working with Delta Live Tables (DLT) in Databricks, and I'm particularly interested in understanding how the dropDuplicates function works when using the Photon engine. Photon is known for its columnar data processing capabilities, which significantly enhance performance. However, I've noticed something intriguing about how dropDuplicates handles data. It appears that the function might use the FIRST operation to determine which values to keep when removing duplicates.

This raises an important question: Could FIRST potentially select values from different rows for different columns?
To illustrate, consider the following example:
If we apply dropDuplicates on ID, the result might be (?)
In this case, the value of COL1 is A and COL2 is X for ID=1, but there's no guarantee that COL2 comes from the same row as COL1. This behavior is due to the possibility that FIRST(COL1) and FIRST(COL2) might select values from different rows (like 1 A Z which does not exist as record)
So, my question for the community is: When using dropDuplicates in a DLT pipeline with Photon, how exactly are the values for the columns selected? Is it possible that FIRST could take values from different rows?
Looking forward to your insights and experiences!