Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-19-2023 10:52 AM
This issue was due to spark parallelization which doesn't guarantee the same data is assigned to each partition.
I was able to resolve this by making sure the same data is assigned to the same partitions :
df.repartition(num_partitions, "ur_col_id")
df.sortWithinPartitions("ur_col_id")