cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Which table should i use for a range join hint?

Travis84
New Contributor II

I am a bit confused about how to use range join hints. Consider the following query

 

 
```
SELECT
  p.id,
  p.ts,
  p.value,
  rg.metric1,
  rg.metric2,
  rg.ts AS range_ts
FROM points p
LEFT JOIN LATERAL (
  SELECT r.metric1, r.metric2, r.ts
  FROM ranges r
  WHERE r.point_id = p.id
    AND r.ts BETWEEN p.ts - INTERVAL 1 MINUTE AND p.ts
  ORDER BY r.ts DESC
  LIMIT 1
) rg
```
should the hint be applies on points or on ranges?
does the hint go immediately after the first select statement?
 
In the documentation (Range join optimization | Databricks on AWS) the first example shows the hint applying to the points table, the 3rd example shows the hint applying to the ranges like table.
1 REPLY 1

K_Anudeep
Databricks Employee
Databricks Employee

Hello @Travis84 ,

Below are the answers to your questions:

Where to put the hint? On either one of the two relations that participate in the range join for that specific join block. In simple two-table queries, it doesnโ€™t matter. In multi-join queries, you must attach it to the relation involved in the range join at that stage (SQL is left-associative, so (a JOIN b) JOIN c: a hint on a affects a JOIN b, not the later join to c).

Where in the SQL does it go? As a SELECT hint, immediately after SELECT, e.g. SELECT /*+ RANGE_JOIN(points, 60) */ โ€ฆ

In the SQL above, your condition is r.ts BETWEEN p.ts - INTERVAL 1 MINUTE AND p.ts.
For the range-join optimisation on a LEFT join, Databricks requires the point value to be on the left side of the join. You can rewrite the predicate equivalently as:
p.ts BETWEEN r.ts AND r.ts + INTERVAL 1 MINUTE and then apply the hint.

Example:

SELECT /*+ RANGE_JOIN(p, 60) */
p.id, p.ts, p.value,
r.metric1, r.metric2, r.ts AS range_ts
FROM points p
LEFT JOIN ranges r
ON r.point_id = p.id
AND p.ts BETWEEN r.ts AND r.ts + INTERVAL 1 MINUTE
QUALIFY row_number() OVER (PARTITION BY p.id, p.ts ORDER BY r.ts DESC) = 1;

 

Or
 
LEFT JOIN LATERAL (
SELECT /*+ RANGE_JOIN(r, 60) */ r.metric1, r.metric2, r.ts
FROM ranges r
WHERE r.point_id = p.id
AND p.ts BETWEEN r.ts AND r.ts + INTERVAL 1 MINUTE
ORDER BY r.ts DESC
LIMIT 1
) rg

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now