Re: Out of Memory after adding distinct operation

MadhuB · ‎01-24-2025

Distinct is a very expense operation. For your case, I recommend to use either of the below deduplication strategies.

Most efficient method
df_deduped = df.dropDuplicates(subset=['unique_key_columns'])

For complex dedupe process - Partitioning and filter based on the rank.
WITH ranked_data AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY unique_key_columns ORDER BY timestamp DESC) as rnk
FROM table_1
)
SELECT * FROM ranked_data WHERE rnk = 1

Alternatively, there are ways to increase the executor memory or use a memory optimized cluster while configuring a job compute.

Let me know for anything, else please mark it as a solution. Cheers!