cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Translations from T-SQL: TOP 1 OUTER APPLY or LEFT JOIN

MattHeidebrecht
New Contributor II

Hi All,

I am wondering how you would go about translating either of the below to Spark SQL in Databricks.  They are more or less equivalent statements in T-SQL.

Please note that I am attempting to pair each unique Policy (IPI_ID) record with its highest numbered Location (IL_ID) record.  There can be many Location records for each Policy record.  The Location table links to the Policy table via Policy.IPI_ID = Location.IL_IPI_ID.

I have tried to utilize LIMIT 1 in certain ways (example further below) but either receive errors or the results do not match.

Any help or suggestions are appreciated!

T-SQL:

 

select
	ipi.IPI_ID
	,loc.IL_ID
from Policy ipi
outer apply
	(
	select top 1 il.IL_ID
	from Location il
	where il.IL_IPI_ID = ipi.IPI_ID
	order by
		il.IL_ID desc
	) loc

--

select
	ipi.IPI_ID
	,il.IL_ID
from Policy ipi
left join Location il
	on il.IL_ID =
		(
		select top 1 il2.IL_ID
		from Location il2
		where il2.IL_IPI_ID = ipi.IPI_ID
		order by
			il2.IL_ID desc
		)

 

Errors out in Databricks Spark SQL:

 

select
	ipi.IPI_ID
	,il.IL_ID
from Policy ipi
left join Location il
	on il.IL_ID =
	(
	select il2.IL_ID
	from Location il2
	where il2.IL_IPI_ID = ipi.IPI_ID
	order by
		il2.IL_ID desc
  limit 1
  );

 

1 ACCEPTED SOLUTION

Accepted Solutions

filipniziol
Contributor III

Hi @MattHeidebrecht ,

You can use ROW_NUMBER window function to order rows and then filter on that rank.

Here is the query:

WITH loc_ranked AS (
    SELECT
        IL_IPI_ID,
        IL_ID,
        ROW_NUMBER() OVER (PARTITION BY IL_IPI_ID ORDER BY IL_ID DESC) AS rn
    FROM Location
)

SELECT
    ipi.IPI_ID,
    loc_ranked.IL_ID
FROM Policy ipi
LEFT JOIN loc_ranked
    ON ipi.IPI_ID = loc_ranked.IL_IPI_ID
    AND loc_ranked.rn = 1;

 

Here is the end to end example with sample data:

CREATE OR REPLACE TEMP VIEW Policy AS
SELECT * FROM VALUES
  (1),
  (2),
  (3)
AS t(IPI_ID);

CREATE OR REPLACE TEMP VIEW Location AS
SELECT * FROM VALUES
  (1, 10),
  (1, 5),
  (2, 100),
  (2, 90),
  (3, 300),
  (3, 250),
  (3, 100)
AS t(IL_IPI_ID, IL_ID);

WITH loc_ranked AS (
    SELECT
        IL_IPI_ID,
        IL_ID,
        ROW_NUMBER() OVER (PARTITION BY IL_IPI_ID ORDER BY IL_ID DESC) AS rn
    FROM Location
)
SELECT
    ipi.IPI_ID,
    loc_ranked.IL_ID
FROM Policy ipi
LEFT JOIN loc_ranked
    ON ipi.IPI_ID = loc_ranked.IL_IPI_ID
    AND loc_ranked.rn = 1;

The result:

filipniziol_1-1733427959985.png

 

View solution in original post

3 REPLIES 3

filipniziol
Contributor III

Hi @MattHeidebrecht ,

You can use ROW_NUMBER window function to order rows and then filter on that rank.

Here is the query:

WITH loc_ranked AS (
    SELECT
        IL_IPI_ID,
        IL_ID,
        ROW_NUMBER() OVER (PARTITION BY IL_IPI_ID ORDER BY IL_ID DESC) AS rn
    FROM Location
)

SELECT
    ipi.IPI_ID,
    loc_ranked.IL_ID
FROM Policy ipi
LEFT JOIN loc_ranked
    ON ipi.IPI_ID = loc_ranked.IL_IPI_ID
    AND loc_ranked.rn = 1;

 

Here is the end to end example with sample data:

CREATE OR REPLACE TEMP VIEW Policy AS
SELECT * FROM VALUES
  (1),
  (2),
  (3)
AS t(IPI_ID);

CREATE OR REPLACE TEMP VIEW Location AS
SELECT * FROM VALUES
  (1, 10),
  (1, 5),
  (2, 100),
  (2, 90),
  (3, 300),
  (3, 250),
  (3, 100)
AS t(IL_IPI_ID, IL_ID);

WITH loc_ranked AS (
    SELECT
        IL_IPI_ID,
        IL_ID,
        ROW_NUMBER() OVER (PARTITION BY IL_IPI_ID ORDER BY IL_ID DESC) AS rn
    FROM Location
)
SELECT
    ipi.IPI_ID,
    loc_ranked.IL_ID
FROM Policy ipi
LEFT JOIN loc_ranked
    ON ipi.IPI_ID = loc_ranked.IL_IPI_ID
    AND loc_ranked.rn = 1;

The result:

filipniziol_1-1733427959985.png

 

MattHeidebrecht
New Contributor II

Thanks filipniziol!  I'll start running with that when I run into cases where I need an embedded TOP 1.

Hi @MattHeidebrecht ,

Great! If this resolves your question, please consider marking it as the solution. It helps others in the community find answers more easily. 😊

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group