I'm new to Pyspark, but I've stumbled across an odd issue when I perform joins, where the action seems to take exponentially longer every time I add a new join to a function I'm writing.
I'm trying to join a dataset of ~3 million records to one of ~17 million ten times (each time with slightly different join criteria). Each join on it's own takes 15-50 seconds to commit, however when I add the joins together in one function, the action takes exponentially longer (e.g join 2 runs in a minute, but by join 5 the function takes about 11 minutes to run and by join 7/8 the notebook will run for hours and then give a generic cluster error).
I've tried repartitioning and cacheing the data before joins, but if anything this seems to slow down the joins even further.
I can't work out what I've done wrong, and from QAing every line of the notebook, nothing obvious is jumping out.