Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-24-2025 08:10 PM
Thanks for the response. Could you please elaborate why is the distinct is an expensive operation? From my understanding, it's similar to a group by operation, where Spark likely uses hashing as a key to shuffle the data and eliminate duplicates. Why adding just distinct causing job failure/huge overrun? if you look at the query ie step-2, has group by on few columns (not all fields part of distinct ) to compute sum, max metrics which works fine.