12-20-2022 09:54 AM
I have a table that looks like this:
/* input */
-- | parent | child |
-- | ------ | ----- |
-- | 1 | 2 |
-- | 2 | 3 |
-- | 3 | 4 |
-- | 5 | 6 |
-- | 6 | 7 |
-- | 8 | 9 |
-- | 10 | 11 |
and I want create something that looks like this:
/* output */
-- | person | first_ancestor |
-- | ------ | --------------- |
-- | 1 | 1 |
-- | 2 | 1 |
-- | 3 | 1 |
-- | 4 | 1 |
-- | 5 | 5 |
-- | 6 | 5 |
-- | 7 | 5 |
-- | 8 | 8 |
-- | 9 | 8 |
-- | 10 | 10 |
-- | 11 | 10 |
If I were using bigquery, it would be simple using a recursive CTE
-- recursive CTE: get the first ancestor of each record
with recursive table_data as (
-- populate data
select 1 as parent, 2 as child union all
select 2,3 union all
select 3,4 union all
select 5,6 union all
select 6,7 union all
select 8,9 union all
select 10,11
),
base_records as (
-- make the base case (get all records that are not children of another record)
-- also has the effect of filtering out cyclic groups (e.g. 1->2->->1)
select parent as first_ancestor,parent as person
from table_data
where parent not in (select child from table_data)
),
lookup_table as (
-- first start with the base case...the first ancestor
select person,first_ancestor from base_records
union all
-- recursively add descendants, noting the first ancestor to that descendant
select table_data.child as person,lookup_table.first_ancestor
from lookup_table
join table_data
on lookup_table.person = table_data.parent
)
select person,first_ancestor from lookup_table
order by first_ancestor,person
is there 1) some function that has recursion in databricks (spark sql or pyspark), or 2) a canonical way to do this?
I can brute force this by coding an arbitrary length join (join table_data t1 to table_data t2 to table_data t3 ...) but was wondering if there's a less inefficient/more dynamic way. I know nothing about graph databases or if the proper solution lies in them.
Thank you so much 🙂
12-20-2022 12:07 PM
Hi @Jonathan Dufault ,
Around Recursive CTEs, at the moment they aren't really supported on Apache Spark (and Databricks). There are some workarounds that might help such as this one by my colleague https://medium.com/@24chynoweth/recursive-cte-on-databricks-2ac0dff8ca06 or this slightly newer post https://medium.com/globant/how-to-implement-recursive-queries-in-spark-3d26f7ed3bc9.
Hopefully this helps, - LG
12-20-2022 12:07 PM
Hi @Jonathan Dufault ,
Around Recursive CTEs, at the moment they aren't really supported on Apache Spark (and Databricks). There are some workarounds that might help such as this one by my colleague https://medium.com/@24chynoweth/recursive-cte-on-databricks-2ac0dff8ca06 or this slightly newer post https://medium.com/globant/how-to-implement-recursive-queries-in-spark-3d26f7ed3bc9.
Hopefully this helps, - LG
12-21-2022 09:59 AM
wow this completely answers my question. do you have a sense for whether that's a thing that's going to be implemented at some point/how important or not it is? just curious mainly.
12-21-2022 11:00 AM
Hi @Jonathan Dufault ,
I'm glad it answered your question. From what I can see internally there is a decent amount of demand for this feature so I have a feeling it will be implemented in the future, but I don't have any timelines at this point.
It also appears that OSS Spark has two pull requests that would implement Recursive CTEs which shows promise for this feature
01-22-2023 11:03 AM
@Landan George
Hey, I am looking into same issue, but when I execute what's suggested in the post for CTE_Recursive https://medium.com/globant/how-to-implement-recursive-queries-in-spark-3d26f7ed3bc9
I get error
Error in SQL statement: AnalysisException: Table or view not found: CTE_Recursive; line x pos y;
I created the same table as in the post's example and even copied the code and executed it so I am not doing something else.
Can you please advise ?
01-24-2023 07:26 AM
I selected the above answer because it answered the "oh this function doesn't exist." I didn't try the workaround. @Landan George just letting you know this is a question above
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group