Databricks Community

smoortema · 3 weeks ago

What is the best way to know what kind of join was used for a SQL query between broadcast, shuffle hash and sort merge? How can the spark UI or the query plan be interpreted?

Louis_Frolio · 3 weeks ago

Hello @smoortema , here are some helpful tips and tricks.

Here’s how to quickly determine which join strategy Spark used—between broadcast hash join, shuffle hash join, and sort-merge join—and how to read both the query plan and the Spark UI to verify it.

Quick answers

The easiest way: run SQL EXPLAIN or DataFrame.explain to see the initial physical plan; look for operator names like BroadcastHashJoin, ShuffleHashJoin, or SortMergeJoin in the plan output.
To see what was actually executed (especially with AQE enabled), use the Spark UI’s SQL tab. The diagram shows the current/final plan; join nodes are labeled and include metrics (rows output, shuffle read/write, broadcast size).
With AQE, the initial plan shown by EXPLAIN may differ from the executed plan; the Spark UI reflects dynamic changes (e.g., SMJ converted to BHJ at runtime).

How to tell from the query plan

Use SQL EXPLAIN or DataFrame.explain to inspect the physical plan before execution; scan for join nodes: * BroadcastHashJoin → broadcast hash join. * ShuffleHashJoin → shuffle hash join. * SortMergeJoin → sort-merge join.
EXPLAIN always shows the initial plan and does not reflect AQE re-optimizations; compare EXPLAIN output with the Spark UI to see if AQE changed the join at runtime.
In Databricks, AQE can dynamically change a planned sort-merge join into a broadcast hash join if a join side is under the adaptive broadcast threshold (default 30MB). Look for different join nodes between initial and current/final plans to confirm the change.
If using Photon, you may see Photon-specific operators (e.g., PhotonBroadcastHashJoin); this indicates Photon executed that part of the plan.

How to tell from the Spark UI

Open the SQL tab → select the query → view the DAG/plan diagram. Join operators are labeled directly: * BroadcastHashJoin for BHJ. * ShuffleHashJoin for SHJ. * SortMergeJoin for SMJ.
Hover or expand join nodes to see metrics:
- Rows output can reveal “row explosion” (unexpectedly high output cardinality).
- Shuffle read/write shows how much data moved for SHJ/SMJ.
- Broadcast size appears for BHJ stages and helps confirm broadcast happened.
With AQE, the plan diagram can evolve during execution; the Spark UI shows the current/final executed plan, not the initial plan. Use it to verify runtime strategy changes (e.g., SMJ → BHJ) and optimizations like partition coalescing or skew handling via CustomShuffleReader annotations (coalesced/skewed).

Notes about AQE (Adaptive Query Execution)

AQE may switch sort-merge join to broadcast hash join at runtime based on accurate post-shuffle statistics; the threshold for dynamic switch is spark.databricks.adaptive.autoBroadcastJoinThreshold (default 30MB).
EXPLAIN does not execute the query, so it shows the initial plan only; the Spark UI shows the plan as it evolves and the final executed plan, making it the authoritative source for what actually ran under AQE.
AQE also handles skew in SMJ/SHJ by splitting skewed partitions; you’ll see indicators like SortMergeJoin with isSkew=true and CustomShuffleReader with skewed in the plan/UI.

Forcing or controlling join types (when needed)

Use join hints to request a strategy:
- BROADCAST(table) → broadcast hash join.
- MERGE(table) or SHUFFLE_MERGE(table) → sort-merge join.
- SHUFFLE_HASH(table) → shuffle hash join.
- Spark prioritizes hints: BROADCAST over MERGE over SHUFFLE_HASH over SHUFFLE_REPLICATE_NL; not all strategies support all join types.
Key configs:
- spark.sql.autoBroadcastJoinThreshold controls static broadcast planning (Spark typically broadcasts small tables by default; many workloads start around 10MB, adjustable).
- spark.databricks.adaptive.autoBroadcastJoinThreshold controls AQE’s dynamic switch to BHJ at runtime (default 30MB).
- spark.sql.join.preferSortMergeJoin (true by default) can be set to false to prefer SHJ where feasible; Photon similarly tends to favor SHJ to speed up queries.
Even with AQE enabled, broadcast hints can still outperform a dynamic conversion because AQE may only decide to broadcast after both sides shuffle; hints avoid that shuffle upfront if you know a side is small.

Practical checklist

Before running:
- EXPLAIN your query; confirm the planned join node names match expectations (BHJ/SHJ/SMJ).
After running:
- Spark UI → SQL tab → check the join node label and metrics to see what actually executed and whether AQE changed it.
If the executed plan isn’t the one you want:
- Consider adding a join hint or adjusting configs (autoBroadcastJoinThreshold, preferSortMergeJoin) and rerun; verify again in EXPLAIN and the Spark UI.

Useful references

Adaptive Query Execution user guide (plans, Spark UI behavior, configs, dynamic BHJ conversion).
Join hints syntax and priority (BROADCST/MERGE/SHUFFLE_HASH/SHUFFLE_REPLICATE_NL).
Best practices for choosing BHJ vs SMJ vs SHJ and reading Spark UI join metrics.
AQE blog posts (identifying strategy changes and CustomShuffleReader coalesce/skew indicators).

Hope this helps, Louis.

View solution in original post

Louis_Frolio · 3 weeks ago

Hello @smoortema , here are some helpful tips and tricks.

Here’s how to quickly determine which join strategy Spark used—between broadcast hash join, shuffle hash join, and sort-merge join—and how to read both the query plan and the Spark UI to verify it.

Quick answers

The easiest way: run SQL EXPLAIN or DataFrame.explain to see the initial physical plan; look for operator names like BroadcastHashJoin, ShuffleHashJoin, or SortMergeJoin in the plan output.
To see what was actually executed (especially with AQE enabled), use the Spark UI’s SQL tab. The diagram shows the current/final plan; join nodes are labeled and include metrics (rows output, shuffle read/write, broadcast size).
With AQE, the initial plan shown by EXPLAIN may differ from the executed plan; the Spark UI reflects dynamic changes (e.g., SMJ converted to BHJ at runtime).

How to tell from the query plan

Use SQL EXPLAIN or DataFrame.explain to inspect the physical plan before execution; scan for join nodes: * BroadcastHashJoin → broadcast hash join. * ShuffleHashJoin → shuffle hash join. * SortMergeJoin → sort-merge join.
EXPLAIN always shows the initial plan and does not reflect AQE re-optimizations; compare EXPLAIN output with the Spark UI to see if AQE changed the join at runtime.
In Databricks, AQE can dynamically change a planned sort-merge join into a broadcast hash join if a join side is under the adaptive broadcast threshold (default 30MB). Look for different join nodes between initial and current/final plans to confirm the change.
If using Photon, you may see Photon-specific operators (e.g., PhotonBroadcastHashJoin); this indicates Photon executed that part of the plan.

How to tell from the Spark UI

Open the SQL tab → select the query → view the DAG/plan diagram. Join operators are labeled directly: * BroadcastHashJoin for BHJ. * ShuffleHashJoin for SHJ. * SortMergeJoin for SMJ.
Hover or expand join nodes to see metrics:
- Rows output can reveal “row explosion” (unexpectedly high output cardinality).
- Shuffle read/write shows how much data moved for SHJ/SMJ.
- Broadcast size appears for BHJ stages and helps confirm broadcast happened.
With AQE, the plan diagram can evolve during execution; the Spark UI shows the current/final executed plan, not the initial plan. Use it to verify runtime strategy changes (e.g., SMJ → BHJ) and optimizations like partition coalescing or skew handling via CustomShuffleReader annotations (coalesced/skewed).

Notes about AQE (Adaptive Query Execution)

AQE may switch sort-merge join to broadcast hash join at runtime based on accurate post-shuffle statistics; the threshold for dynamic switch is spark.databricks.adaptive.autoBroadcastJoinThreshold (default 30MB).
EXPLAIN does not execute the query, so it shows the initial plan only; the Spark UI shows the plan as it evolves and the final executed plan, making it the authoritative source for what actually ran under AQE.
AQE also handles skew in SMJ/SHJ by splitting skewed partitions; you’ll see indicators like SortMergeJoin with isSkew=true and CustomShuffleReader with skewed in the plan/UI.

Forcing or controlling join types (when needed)

Use join hints to request a strategy:
- BROADCAST(table) → broadcast hash join.
- MERGE(table) or SHUFFLE_MERGE(table) → sort-merge join.
- SHUFFLE_HASH(table) → shuffle hash join.
- Spark prioritizes hints: BROADCAST over MERGE over SHUFFLE_HASH over SHUFFLE_REPLICATE_NL; not all strategies support all join types.
Key configs:
- spark.sql.autoBroadcastJoinThreshold controls static broadcast planning (Spark typically broadcasts small tables by default; many workloads start around 10MB, adjustable).
- spark.databricks.adaptive.autoBroadcastJoinThreshold controls AQE’s dynamic switch to BHJ at runtime (default 30MB).
- spark.sql.join.preferSortMergeJoin (true by default) can be set to false to prefer SHJ where feasible; Photon similarly tends to favor SHJ to speed up queries.
Even with AQE enabled, broadcast hints can still outperform a dynamic conversion because AQE may only decide to broadcast after both sides shuffle; hints avoid that shuffle upfront if you know a side is small.

Practical checklist

Before running:
- EXPLAIN your query; confirm the planned join node names match expectations (BHJ/SHJ/SMJ).
After running:
- Spark UI → SQL tab → check the join node label and metrics to see what actually executed and whether AQE changed it.
If the executed plan isn’t the one you want:
- Consider adding a join hint or adjusting configs (autoBroadcastJoinThreshold, preferSortMergeJoin) and rerun; verify again in EXPLAIN and the Spark UI.

Useful references

Adaptive Query Execution user guide (plans, Spark UI behavior, configs, dynamic BHJ conversion).
Join hints syntax and priority (BROADCST/MERGE/SHUFFLE_HASH/SHUFFLE_REPLICATE_NL).
Best practices for choosing BHJ vs SMJ vs SHJ and reading Spark UI join metrics.
AQE blog posts (identifying strategy changes and CustomShuffleReader coalesce/skew indicators).

Hope this helps, Louis.

smoortema · 3 weeks ago

Thanks for the useful informations! I have two additional questions:

1. Your answer looks like it is LLM-generated. If it is, could you share which LLM you used for it?

2. What is the best way to find the query in Spark UI? I am getting there through Compute, then selecting the compute I used, then Spark UI. Here I can find many queries, and it is not always evident which one I am looking for. Also, I can only see the queries that happened since the last restart of the compute. So finding the query becomes especially hard once it has already completed. Is there an easier way to find the query I am looking for?

Louis_Frolio · 3 weeks ago

@smoortema , Spark performance tuning is one of the hardest topics to teach or learn, and it’s even tougher to do justice to in a forum thread. That said, I’m really glad to see you asking the question. Tuning is challenging precisely because there are so many moving pieces, which is why AQE was introduced in the first place — to take a large portion of that burden off your shoulders.

If you want to go deeper, structured training is your best path. Databricks offers courses that walk through tuning concepts step by step, and I’m sure platforms like Udemy have solid options as well. A guided approach will give you the most clarity and confidence as you level up your skills.

Hope this helps, Louis.