john_odwyer
Databricks Employee
Databricks Employee

The issue is probably related to the self join between 100 million rows, I'm not positive without seeing the code and understanding the problem better but you may want to think about using windowing functions instead

https://blog.knoldus.com/using-windows-in-spark-to-avoid-joins/

View solution in original post