- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-20-2022 09:54 AM
I have a table that looks like this:
/* input */
-- | parent | child |
-- | ------ | ----- |
-- | 1 | 2 |
-- | 2 | 3 |
-- | 3 | 4 |
-- | 5 | 6 |
-- | 6 | 7 |
-- | 8 | 9 |
-- | 10 | 11 |
and I want create something that looks like this:
/* output */
-- | person | first_ancestor |
-- | ------ | --------------- |
-- | 1 | 1 |
-- | 2 | 1 |
-- | 3 | 1 |
-- | 4 | 1 |
-- | 5 | 5 |
-- | 6 | 5 |
-- | 7 | 5 |
-- | 8 | 8 |
-- | 9 | 8 |
-- | 10 | 10 |
-- | 11 | 10 |
If I were using bigquery, it would be simple using a recursive CTE
-- recursive CTE: get the first ancestor of each record
with recursive table_data as (
-- populate data
select 1 as parent, 2 as child union all
select 2,3 union all
select 3,4 union all
select 5,6 union all
select 6,7 union all
select 8,9 union all
select 10,11
),
base_records as (
-- make the base case (get all records that are not children of another record)
-- also has the effect of filtering out cyclic groups (e.g. 1->2->->1)
select parent as first_ancestor,parent as person
from table_data
where parent not in (select child from table_data)
),
lookup_table as (
-- first start with the base case...the first ancestor
select person,first_ancestor from base_records
union all
-- recursively add descendants, noting the first ancestor to that descendant
select table_data.child as person,lookup_table.first_ancestor
from lookup_table
join table_data
on lookup_table.person = table_data.parent
)
select person,first_ancestor from lookup_table
order by first_ancestor,person
is there 1) some function that has recursion in databricks (spark sql or pyspark), or 2) a canonical way to do this?
I can brute force this by coding an arbitrary length join (join table_data t1 to table_data t2 to table_data t3 ...) but was wondering if there's a less inefficient/more dynamic way. I know nothing about graph databases or if the proper solution lies in them.
Thank you so much 🙂
- Labels:
-
Recursive Cte
-
Table Data
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-20-2022 12:07 PM
Hi @Jonathan Dufault ,
Around Recursive CTEs, at the moment they aren't really supported on Apache Spark (and Databricks). There are some workarounds that might help such as this one by my colleague https://medium.com/@24chynoweth/recursive-cte-on-databricks-2ac0dff8ca06 or this slightly newer post https://medium.com/globant/how-to-implement-recursive-queries-in-spark-3d26f7ed3bc9.
Hopefully this helps, - LG
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-20-2022 12:07 PM
Hi @Jonathan Dufault ,
Around Recursive CTEs, at the moment they aren't really supported on Apache Spark (and Databricks). There are some workarounds that might help such as this one by my colleague https://medium.com/@24chynoweth/recursive-cte-on-databricks-2ac0dff8ca06 or this slightly newer post https://medium.com/globant/how-to-implement-recursive-queries-in-spark-3d26f7ed3bc9.
Hopefully this helps, - LG
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2022 09:59 AM
wow this completely answers my question. do you have a sense for whether that's a thing that's going to be implemented at some point/how important or not it is? just curious mainly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2022 11:00 AM
Hi @Jonathan Dufault ,
I'm glad it answered your question. From what I can see internally there is a decent amount of demand for this feature so I have a feeling it will be implemented in the future, but I don't have any timelines at this point.
It also appears that OSS Spark has two pull requests that would implement Recursive CTEs which shows promise for this feature
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-22-2023 11:03 AM
@Landan George
Hey, I am looking into same issue, but when I execute what's suggested in the post for CTE_Recursive https://medium.com/globant/how-to-implement-recursive-queries-in-spark-3d26f7ed3bc9
I get error
Error in SQL statement: AnalysisException: Table or view not found: CTE_Recursive; line x pos y;
I created the same table as in the post's example and even copied the code and executed it so I am not doing something else.
Can you please advise ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-24-2023 07:26 AM
I selected the above answer because it answered the "oh this function doesn't exist." I didn't try the workaround. @Landan George just letting you know this is a question above