Re: Performance Issue with MinHash + Approx Simila...

Louis_Frolio · ‎09-18-2025

Hey @dsoat, out of curiosity:

1. Expectations

You mentioned it’s “taking longer than expected.” What were your expectations, and what are you comparing them against?

2. Data

• What is the total size of the dataset after initial filtering (in GB or rows)?

• Are there any unusual data skews or highly frequent tokens/values in the matching columns?

• What is the estimated cardinality of the blocking keys or generated tokens?

• How many columns are involved in the LSH and join operations, and what are their data types?

• How many unique records participate in the self-join phase?

• Are there known outliers or unusually large input files that could create unbalanced partitions?

3. Cluster Configuration

• How many nodes are in the cluster, and what are their specs (memory, vCPUs per node)?

• How many executors, executor cores, and how much executor memory is allocated?

• What value is set for spark.sql.shuffle.partitions, and has it been tuned for your workload?

• Is dynamic resource allocation enabled (spark.dynamicAllocation.enabled), or is allocation static?

• Have you observed executor/driver OOM errors, excessive GC, or shuffle spill/writes in the Spark UI?

• Is Adaptive Query Execution (AQE) enabled in the Databricks workspace?

4. Environment

• What Spark and Databricks Runtime versions are you using?

• Are you working with Delta Lake tables, and if so, are they optimized/compacted?

• Is this workload running as a scheduled job or as ad-hoc analysis?

Let me know.

Cheers, Louis.