cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Understanding dropDuplicates in Delta Live Tables (DLT) with Photon

mai_luca
New Contributor III

Hi everyone,

I've been working with Delta Live Tables (DLT) in Databricks, and I'm particularly interested in understanding how the dropDuplicates function works when using the Photon engine. Photon is known for its columnar data processing capabilities, which significantly enhance performance. However, I've noticed something intriguing about how dropDuplicates handles data. It appears that the function might use the FIRST operation to determine which values to keep when removing duplicates.

plan.png

This raises an important question: Could FIRST potentially select values from different rows for different columns?

To illustrate, consider the following example:

IDCOL1COL2
1AX
2AY
1BZ

If we apply dropDuplicates on ID, the result might be (?)

IDCOL1COL2
1AX
2AY

In this case, the value of COL1 is A and COL2 is  for ID=1, but there's no guarantee that COL2 comes from the same row as COL1. This behavior is due to the possibility that FIRST(COL1) and FIRST(COL2) might select values from different rows (like 1 A Z which does not exist as record)

So, my question for the community is: When using dropDuplicates in a DLT pipeline with Photon, how exactly are the values for the columns selected? Is it possible that FIRST could take values from different rows?

Looking forward to your insights and experiences!

1 ACCEPTED SOLUTION

Accepted Solutions

cgrant
Databricks Employee
Databricks Employee

FIRST() never stitches together values from different rows.
When Photon executes dropDuplicates, it deterministically chooses one complete row for each set of duplicate keys and returns every column from that same row. If you ever encounter a result where columns appear to come from different rows, please open a support ticketโ€”that would indicate a bug.

Under the hood, FIRST() just returns the first row encountered in each (deterministically-ordered) batch, so all column values originate from that single row.

View solution in original post

1 REPLY 1

cgrant
Databricks Employee
Databricks Employee

FIRST() never stitches together values from different rows.
When Photon executes dropDuplicates, it deterministically chooses one complete row for each set of duplicate keys and returns every column from that same row. If you ever encounter a result where columns appear to come from different rows, please open a support ticketโ€”that would indicate a bug.

Under the hood, FIRST() just returns the first row encountered in each (deterministically-ordered) batch, so all column values originate from that single row.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now