- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-18-2025 08:17 AM
Hey @dsoat, out of curiosity:
1. Expectations
You mentioned it’s “taking longer than expected.” What were your expectations, and what are you comparing them against?
2. Data
• What is the total size of the dataset after initial filtering (in GB or rows)?
• Are there any unusual data skews or highly frequent tokens/values in the matching columns?
• What is the estimated cardinality of the blocking keys or generated tokens?
• How many columns are involved in the LSH and join operations, and what are their data types?
• How many unique records participate in the self-join phase?
• Are there known outliers or unusually large input files that could create unbalanced partitions?
3. Cluster Configuration
• How many nodes are in the cluster, and what are their specs (memory, vCPUs per node)?
• How many executors, executor cores, and how much executor memory is allocated?
• What value is set for spark.sql.shuffle.partitions, and has it been tuned for your workload?
• Is dynamic resource allocation enabled (spark.dynamicAllocation.enabled), or is allocation static?
• Have you observed executor/driver OOM errors, excessive GC, or shuffle spill/writes in the Spark UI?
• Is Adaptive Query Execution (AQE) enabled in the Databricks workspace?
4. Environment
• What Spark and Databricks Runtime versions are you using?
• Are you working with Delta Lake tables, and if so, are they optimized/compacted?
• Is this workload running as a scheduled job or as ad-hoc analysis?
Let me know.
Cheers, Louis.