cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How do I make approxSimilarityJoin work on 25k 300-d vectors?

maffeenAF
New Contributor

I’m trying to use LSH approxSimilarityJoin on a dataset with ~25k 300-d vectors of floats. It gets stuck and eventually fails with ’Slave lost’ error. The size of cluster and memory are likely not a problem, the failure happens even with 16 nodes, 16 cores each, 64G RAM (driver of the same size). What would be your suggestions - how do I make it work?

Using Spark 2.4.5 on GCP DataProc

1 REPLY 1

Dan_Z
Honored Contributor
Honored Contributor

Use a PandasUDF with Arrow enabled. They are improved in Spark 3, but you can use them in Spark 2.4.5.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.