Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-24-2025 10:32 AM
Hi @Klusener
Distinct is a very expense operation. For your case, I recommend to use either of the below deduplication strategies.
Most efficient method
df_deduped = df.dropDuplicates(subset=['unique_key_columns'])
For complex dedupe process - Partitioning and filter based on the rank.
WITH ranked_data AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY unique_key_columns ORDER BY timestamp DESC) as rnk
FROM table_1
)
SELECT * FROM ranked_data WHERE rnk = 1
Alternatively, there are ways to increase the executor memory or use a memory optimized cluster while configuring a job compute.
Let me know for anything, else please mark it as a solution. Cheers!