Databricks Community

maffeenAF · ‎08-05-2021

I’m trying to use LSH approxSimilarityJoin on a dataset with ~25k 300-d vectors of floats. It gets stuck and eventually fails with ’Slave lost’ error. The size of cluster and memory are likely not a problem, the failure happens even with 16 nodes, 16 cores each, 64G RAM (driver of the same size). What would be your suggestions - how do I make it work?

Using Spark 2.4.5 on GCP DataProc

Dan_Z · ‎08-06-2021

Use a PandasUDF with Arrow enabled. They are improved in Spark 3, but you can use them in Spark 2.4.5.

Databricks Community

How do I make approxSimilarityJoin work on 25k 300-d vectors?

Databricks Community Champion - June 2026 - Amira Bedhiafi

DAIS 2026 Brought 2,800 New Members to the Databricks Community - Welcome Aboard

🌟 Community Pulse: Your Weekly Roundup! June 15 – 21, 2026

Solution Accelerator Series | Creating Brand-Aligned Images Using Generative AI

Build apps without jumping through hoops