Databricks Community

Omri · ‎12-29-2024

I have a complex join that I'm trying to optimize df1 has cols id,main_key,col1,col1_isnull,col2,col2_isnull...col30 df2 has cols id,main_key,col1,col2..col_30

I'm trying to run this sql query on Pyspark

select df1.id, df2.id from df1 join df2 on df1.main_key = df2.main_key AND (df1.col1_is_null OR (df1.col1 = df2.col1)) AND (df1.col2_is_null OR (df1.col2 = df2.col2)) ...

This query takes a very lot of time with just a few long running straggler tasks both dataframes are huge, and the join key is skewed

Things I've tried:

spark.conf.set("spark.sql.adaptive.enabled", "true") spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
Salting the smaller df, exploding the other
broadcasting the smaller df (sometimes the AQE overrides it with a SortMergeJoin(skew=true))
Filtering just the top 2 most common main_key value first, then doing all the above
Splitting the query to joining on main_key and then filtering using a 2nd query

The tasks execution still is very skewed What more can I do to optimize this further?

Alberto_Umana · ‎12-29-2024

To optimize your complex join in PySpark, you can try the following additional strategies:

Skew Join Optimization Using Skew Hints:

You can use skew hints to inform Spark about the skewed keys. This can help Spark to optimize the join by handling the skewed partitions more efficiently. For example:

SELECT /*+ SKEW('df1', 'main_key') */ df1.id, df2.id

FROM df1

JOIN df2 ON df1.main_key = df2.main_key

AND (df1.col1_is_null OR (df1.col1 = df2.col1))

AND (df1.col2_is_null OR (df1.col2 = df2.col2))

...

Adaptive Query Execution (AQE):

Ensure that AQE is enabled, which you have already done. AQE can dynamically optimize the query plan at runtime, including handling skewed joins.
You can further fine-tune AQE settings, such as adjusting the threshold for skew join detection:

spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5")

spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB")

Cost-Based Optimizer (CBO):

Enable and utilize the Cost-Based Optimizer to improve join strategies. Ensure that table statistics are up-to-date:

ANALYZE TABLE df1 COMPUTE STATISTICS FOR ALL COLUMNS;

ANALYZE TABLE df2 COMPUTE STATISTICS FOR ALL COLUMNS;

Dynamic Partition Pruning (DPP):

Ensure that Dynamic Partition Pruning is enabled, which can help in optimizing joins by pruning unnecessary partitions:

spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.enabled", "true")

Alberto_Umana · ‎12-29-2024

Continuation of my comments:

Shuffle Hash Join:

Prefer shuffle hash join over sort-merge join if applicable. This can be more efficient in certain scenarios:

spark.conf.set("spark.sql.join.preferSortMergeJoin", "false")

Data Skew Remediation:

Identify and filter out skewed values if possible. For example, if certain main_key values are causing skew, handle them separately:

skewed_keys = [key1, key2] # Replace with actual skewed keys

df1_skewed = df1.filter(df1.main_key.isin(skewed_keys))

df1_non_skewed = df1.filter(~df1.main_key.isin(skewed_keys))

df2_skewed = df2.filter(df2.main_key.isin(skewed_keys))

df2_non_skewed = df2.filter(~df2.main_key.isin(skewed_keys))

result_skewed = df1_skewed.join(df2_skewed, "main_key")

result_non_skewed = df1_non_skewed.join(df2_non_skewed, "main_key")

result = result_skewed.union(result_non_skewed)

Salting:

If salting the smaller dataframe and exploding the other did not work effectively, ensure that the salting is done correctly and consider increasing the number of salts:

from pyspark.sql.functions import col, expr

num_salts = 10

df1_salted = df1.withColumn("salt", expr(f"floor(rand() * {num_salts})"))

df2_salted = df2.withColumn("salt", expr(f"floor(rand() * {num_salts})"))

result = df1_salted.join(df2_salted, (df1_salted.main_key == df2_salted.main_key) & (df1_salted.salt == df2_salted.salt))

Please refer to:

https://docs.databricks.com/ja/archive/legacy/skew-join.html

https://www.databricks.com/discover/pages/optimize-data-workloads-guide

VZLA · ‎12-31-2024

@Omri thanks for your question!

To help optimize your complex join further, we need clarification on a few details:

Data Characteristics:
- Approximate size of df1 and df2 (in rows and/or size).
- Distribution of main_key in both dataframes—are the top N keys dominating the distribution, and how many keys are significantly skewed?
Cluster Configuration:
- Details of your cluster (e.g., number of nodes, cores, memory per executor).
- Are you using spot instances or dedicated nodes?
Data Source:
- What is the data source for df1 and df2? (e.g., Delta tables, Parquet, or external sources like Redshift/S3?)
- Are the datasets partitioned, and if so, how are they partitioned?
Query Execution Insights:
- Have you examined the Spark UI or execution plans? What stages or tasks have the most skew?
- Are you seeing significant shuffle writes or long-running tasks for specific partitions?
Previous Attempts:
- When salting or exploding, what specific methods did you use? Did you adjust the degree of salting or the granularity of exploding?
- Did you test alternative methods for handling skewed keys, like key bucketing or additional partitioning?
- Were there cases where broadcasting or exploding significantly reduced execution time, or did skew persist across all approaches?

If you could additionally capture some (traceback) python stacktraces, we can get visibility on the operation that is slow running, and potentially correlate with the Spark UI metrics for the slow SQL ID and Stage.

Databricks Community

Optimizing a complex pyspark join

Join Us as a Local Community Builder!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

Databricks Community Champion - September 2025 - Nayanjyoti Sonowal

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming