cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Creating a hierarchy without recursive statements

bmhardy
New Contributor III

I am looking to build a hierarchy from a parent child relationship table, which I would typically use a recursive statement for in SQL Server / Azure SQL. This would mean setting an anchor, most commonly the top record of the tree, and then join back to itself as it recurses through the relationships to build the hierarchichal structure. The problem I have encountered with Databricks is that it doesn't support ANY types of recursive statement, including CTE's and while loops.

The table structure is:

RowId | ParentId | ChildId

The parent Id references the row Id of the parent record. I have tried using SQL and also tried with PySpark yet I am out of ideas. This is a really important part of using Databricks as we need to return near real-time aggregate data. Any suggestions would be greatly appreciated.

4 REPLIES 4

NandiniN
Databricks Employee
Databricks Employee

Use a loop to iteratively join the DataFrame with itself to build the hierarchy. Each iteration will add a new level to the hierarchy.

We also have CTE support https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-qry-select-cte.html

szymon_dybczak
Esteemed Contributor III

Hi @bmhardy ,

As @NandiniN suggested you can use iterative approach to simulate recursion. I think in the databricks roadmap there's a plan to support recursive cte. 

sample_data = [
    (1, None, 2),
    (2, 1, 3),
    (3, 2, 4),
    (4, 3, None)
]

columns = ["RowId", "ParentId", "ChildId"]
df = spark.createDataFrame(sample_data, columns)


df.createOrReplaceTempView("HierarchyTable")

# Initialize parameters
i = 1
start_row_id = 1  # The starting RowId (replace with your root node)

# Recursive anchor
df = spark.sql(f"""
SELECT RowId
     , ParentId
     , ChildId
     , 0 AS RecursionLevel
FROM HierarchyTable
WHERE RowId = {start_row_id}
""")

# Register the initial result as a temporary view for recursion
df.createOrReplaceTempView('recursion_df')

# Use loop to simulate recursion
while True:
   
    bill_df = spark.sql(f"""
    SELECT t.RowId
         , t.ParentId
         , t.ChildId
         , {i} AS RecursionLevel
    FROM recursion_df cte
    INNER JOIN HierarchyTable t ON t.ParentId = cte.RowId
    """)

   
    bill_df.createOrReplaceTempView('recursion_df')

   
    df = df.union(bill_df)

    # Check if the recursion has produced any results
    if bill_df.count() == 0:
        df.createOrReplaceTempView("final_df")
        break
    else:
        i += 1  # Increment recursion level


display(df)

Unfortunately I get the same result as the other methods. It only returns the first level and nothing beyond that. It is probably worth mentioning that I am using Delta Live Tables.

 

 

bmhardy
New Contributor III

Thank you, I will give this a try. I'll let you know how it goes.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now