cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I look up the first ancestor (person,first_ancestor) of a record from a table that has (child,parent) records?

jonathan-dufaul
Valued Contributor

I have a table that looks like this:

/* input */
-- | parent | child |
-- | ------ | ----- |
-- | 1      | 2     |
-- | 2      | 3     |
-- | 3      | 4     |
-- | 5      | 6     |
-- | 6      | 7     |
-- | 8      | 9     |
-- | 10     | 11    |

and I want create something that looks like this:

/* output */
-- | person | first_ancestor |
-- | ------ | --------------- |
-- | 1      | 1               |
-- | 2      | 1               |
-- | 3      | 1               |
-- | 4      | 1               |
-- | 5      | 5               |
-- | 6      | 5               |
-- | 7      | 5               |
-- | 8      | 8               |
-- | 9      | 8               |
-- | 10     | 10              |
-- | 11     | 10              |

If I were using bigquery, it would be simple using a recursive CTE

-- recursive CTE: get the first ancestor of each record
with recursive table_data as (
  -- populate data
  select 1 as parent, 2 as child union all
  select 2,3 union all 
  select 3,4 union all
  select 5,6 union all
  select 6,7 union all
  select 8,9 union all
  select 10,11
), 
base_records as (
    -- make the base case (get all records that are not children of another record)
    -- also has the effect of filtering out cyclic groups (e.g. 1->2->->1)
    select parent as first_ancestor,parent as person 
    from table_data 
    where parent not in (select child from table_data)
    ),
lookup_table as (
  -- first start with the base case...the first ancestor
  select person,first_ancestor from base_records
  union all 
  -- recursively add descendants, noting the first ancestor to that descendant
  select table_data.child as person,lookup_table.first_ancestor
    from lookup_table 
    join table_data 
    on lookup_table.person = table_data.parent
)
select  person,first_ancestor from lookup_table
order by first_ancestor,person

is there 1) some function that has recursion in databricks (spark sql or pyspark), or 2) a canonical way to do this?

I can brute force this by coding an arbitrary length join (join table_data t1 to table_data t2 to table_data t3 ...) but was wondering if there's a less inefficient/more dynamic way. I know nothing about graph databases or if the proper solution lies in them.

Thank you so much 🙂

1 ACCEPTED SOLUTION

Accepted Solutions

LandanG
Honored Contributor
Honored Contributor

Hi @Jonathan Dufault​ ,

Around Recursive CTEs, at the moment they aren't really supported on Apache Spark (and Databricks). There are some workarounds that might help such as this one by my colleague https://medium.com/@24chynoweth/recursive-cte-on-databricks-2ac0dff8ca06 or this slightly newer post https://medium.com/globant/how-to-implement-recursive-queries-in-spark-3d26f7ed3bc9.

Hopefully this helps, - LG

View solution in original post

5 REPLIES 5

LandanG
Honored Contributor
Honored Contributor

Hi @Jonathan Dufault​ ,

Around Recursive CTEs, at the moment they aren't really supported on Apache Spark (and Databricks). There are some workarounds that might help such as this one by my colleague https://medium.com/@24chynoweth/recursive-cte-on-databricks-2ac0dff8ca06 or this slightly newer post https://medium.com/globant/how-to-implement-recursive-queries-in-spark-3d26f7ed3bc9.

Hopefully this helps, - LG

wow this completely answers my question. do you have a sense for whether that's a thing that's going to be implemented at some point/how important or not it is? just curious mainly.

Hi @Jonathan Dufault​ ,

I'm glad it answered your question. From what I can see internally there is a decent amount of demand for this feature so I have a feeling it will be implemented in the future, but I don't have any timelines at this point.

It also appears that OSS Spark has two pull requests that would implement Recursive CTEs which shows promise for this feature

JGil
New Contributor III

@Landan George​ 

Hey, I am looking into same issue, but when I execute what's suggested in the post for CTE_Recursive https://medium.com/globant/how-to-implement-recursive-queries-in-spark-3d26f7ed3bc9

I get error

Error in SQL statement: AnalysisException: Table or view not found: CTE_Recursive; line x pos y;

I created the same table as in the post's example and even copied the code and executed it so I am not doing something else.

Can you please advise ?

I selected the above answer because it answered the "oh this function doesn't exist." I didn't try the workaround. @Landan George​ just letting you know this is a question above

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!