- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-05-2024 10:21 AM
Hi All,
I am wondering how you would go about translating either of the below to Spark SQL in Databricks. They are more or less equivalent statements in T-SQL.
Please note that I am attempting to pair each unique Policy (IPI_ID) record with its highest numbered Location (IL_ID) record. There can be many Location records for each Policy record. The Location table links to the Policy table via Policy.IPI_ID = Location.IL_IPI_ID.
I have tried to utilize LIMIT 1 in certain ways (example further below) but either receive errors or the results do not match.
Any help or suggestions are appreciated!
T-SQL:
select
ipi.IPI_ID
,loc.IL_ID
from Policy ipi
outer apply
(
select top 1 il.IL_ID
from Location il
where il.IL_IPI_ID = ipi.IPI_ID
order by
il.IL_ID desc
) loc
--
select
ipi.IPI_ID
,il.IL_ID
from Policy ipi
left join Location il
on il.IL_ID =
(
select top 1 il2.IL_ID
from Location il2
where il2.IL_IPI_ID = ipi.IPI_ID
order by
il2.IL_ID desc
)
Errors out in Databricks Spark SQL:
select
ipi.IPI_ID
,il.IL_ID
from Policy ipi
left join Location il
on il.IL_ID =
(
select il2.IL_ID
from Location il2
where il2.IL_IPI_ID = ipi.IPI_ID
order by
il2.IL_ID desc
limit 1
);
- Labels:
-
Spark
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-05-2024 11:46 AM
Hi @MattHeidebrecht ,
You can use ROW_NUMBER window function to order rows and then filter on that rank.
Here is the query:
WITH loc_ranked AS (
SELECT
IL_IPI_ID,
IL_ID,
ROW_NUMBER() OVER (PARTITION BY IL_IPI_ID ORDER BY IL_ID DESC) AS rn
FROM Location
)
SELECT
ipi.IPI_ID,
loc_ranked.IL_ID
FROM Policy ipi
LEFT JOIN loc_ranked
ON ipi.IPI_ID = loc_ranked.IL_IPI_ID
AND loc_ranked.rn = 1;
Here is the end to end example with sample data:
CREATE OR REPLACE TEMP VIEW Policy AS
SELECT * FROM VALUES
(1),
(2),
(3)
AS t(IPI_ID);
CREATE OR REPLACE TEMP VIEW Location AS
SELECT * FROM VALUES
(1, 10),
(1, 5),
(2, 100),
(2, 90),
(3, 300),
(3, 250),
(3, 100)
AS t(IL_IPI_ID, IL_ID);
WITH loc_ranked AS (
SELECT
IL_IPI_ID,
IL_ID,
ROW_NUMBER() OVER (PARTITION BY IL_IPI_ID ORDER BY IL_ID DESC) AS rn
FROM Location
)
SELECT
ipi.IPI_ID,
loc_ranked.IL_ID
FROM Policy ipi
LEFT JOIN loc_ranked
ON ipi.IPI_ID = loc_ranked.IL_IPI_ID
AND loc_ranked.rn = 1;
The result:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-05-2024 11:46 AM
Hi @MattHeidebrecht ,
You can use ROW_NUMBER window function to order rows and then filter on that rank.
Here is the query:
WITH loc_ranked AS (
SELECT
IL_IPI_ID,
IL_ID,
ROW_NUMBER() OVER (PARTITION BY IL_IPI_ID ORDER BY IL_ID DESC) AS rn
FROM Location
)
SELECT
ipi.IPI_ID,
loc_ranked.IL_ID
FROM Policy ipi
LEFT JOIN loc_ranked
ON ipi.IPI_ID = loc_ranked.IL_IPI_ID
AND loc_ranked.rn = 1;
Here is the end to end example with sample data:
CREATE OR REPLACE TEMP VIEW Policy AS
SELECT * FROM VALUES
(1),
(2),
(3)
AS t(IPI_ID);
CREATE OR REPLACE TEMP VIEW Location AS
SELECT * FROM VALUES
(1, 10),
(1, 5),
(2, 100),
(2, 90),
(3, 300),
(3, 250),
(3, 100)
AS t(IL_IPI_ID, IL_ID);
WITH loc_ranked AS (
SELECT
IL_IPI_ID,
IL_ID,
ROW_NUMBER() OVER (PARTITION BY IL_IPI_ID ORDER BY IL_ID DESC) AS rn
FROM Location
)
SELECT
ipi.IPI_ID,
loc_ranked.IL_ID
FROM Policy ipi
LEFT JOIN loc_ranked
ON ipi.IPI_ID = loc_ranked.IL_IPI_ID
AND loc_ranked.rn = 1;
The result:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-06-2024 07:22 AM
Thanks filipniziol! I'll start running with that when I run into cases where I need an embedded TOP 1.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-06-2024 08:18 AM
Hi @MattHeidebrecht ,
Great! If this resolves your question, please consider marking it as the solution. It helps others in the community find answers more easily. 😊

