Cosine similarity between all rows pairwise on a dataset of 100million rows

Databricks2005
New Contributor III

Hello everyone,

I am facing performance issue while calculating cosine similarity in pyspark on a dataframe with around 100 million records.

I am trying to do a cross self join on the dataframe to calculate it.​

The executors are all having same number of tasks when seen on the spark ui.

The input size to all executors is also almost the same.

Executors : 20

Cores: 4 cores ​

Any inputs would be highly appreciated​