Understanding dropDuplicates in Delta Live Tables (DLT) with Photon

lucami — Tue, 10 Jun 2025 09:14:56 GMT

Hi everyone,

I've been working with Delta Live Tables (DLT) in Databricks, and I'm particularly interested in understanding how the dropDuplicates function works when using the Photon engine. Photon is known for its columnar data processing capabilities, which significantly enhance performance. However, I've noticed something intriguing about how dropDuplicates handles data. It appears that the function might use the FIRST operation to determine which values to keep when removing duplicates.

This raises an important question: Could FIRST potentially select values from different rows for different columns?

To illustrate, consider the following example:

ID	COL1	COL2
1	A	X
2	A	Y
1	B	Z

If we apply dropDuplicates on ID, the result might be (?)

ID	COL1	COL2
1	A	X
2	A	Y

In this case, the value of COL1 is A and COL2 is X for ID=1, but there's no guarantee that COL2 comes from the same row as COL1. This behavior is due to the possibility that FIRST(COL1) and FIRST(COL2) might select values from different rows (like 1 A Z which does not exist as record)

So, my question for the community is: When using dropDuplicates in a DLT pipeline with Photon, how exactly are the values for the columns selected? Is it possible that FIRST could take values from different rows?

Looking forward to your insights and experiences!

Re: Understanding dropDuplicates in Delta Live Tables (DLT) with Photon

cgrant — Fri, 20 Jun 2025 21:52:26 GMT

FIRST() never stitches together values from different rows.
When Photon executes dropDuplicates, it deterministically chooses one complete row for each set of duplicate keys and returns every column from that same row. If you ever encounter a result where columns appear to come from different rows, please open a support ticket—that would indicate a bug.

Under the hood, FIRST() just returns the first row encountered in each (deterministically-ordered) batch, so all column values originate from that single row.

topic Re: Understanding dropDuplicates in Delta Live Tables (DLT) with Photon in Data Engineering

Understanding dropDuplicates in Delta Live Tables (DLT) with Photon

Re: Understanding dropDuplicates in Delta Live Tables (DLT) with Photon