cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Creating a hierarchy without recursive statements

bmhardy
New Contributor III

I am looking to build a hierarchy from a parent child relationship table, which I would typically use a recursive statement for in SQL Server / Azure SQL. This would mean setting an anchor, most commonly the top record of the tree, and then join back to itself as it recurses through the relationships to build the hierarchichal structure. The problem I have encountered with Databricks is that it doesn't support ANY types of recursive statement, including CTE's and while loops.

The table structure is:

RowId | ParentId | ChildId

The parent Id references the row Id of the parent record. I have tried using SQL and also tried with PySpark yet I am out of ideas. This is a really important part of using Databricks as we need to return near real-time aggregate data. Any suggestions would be greatly appreciated.

3 REPLIES 3

NandiniN
Databricks Employee
Databricks Employee

Use a loop to iteratively join the DataFrame with itself to build the hierarchy. Each iteration will add a new level to the hierarchy.

We also have CTE support https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-qry-select-cte.html

szymon_dybczak
Esteemed Contributor III

Hi @bmhardy ,

As @NandiniN suggested you can use iterative approach to simulate recursion. I think in the databricks roadmap there's a plan to support recursive cte. 

sample_data = [
    (1, None, 2),
    (2, 1, 3),
    (3, 2, 4),
    (4, 3, None)
]

columns = ["RowId", "ParentId", "ChildId"]
df = spark.createDataFrame(sample_data, columns)


df.createOrReplaceTempView("HierarchyTable")

# Initialize parameters
i = 1
start_row_id = 1  # The starting RowId (replace with your root node)

# Recursive anchor
df = spark.sql(f"""
SELECT RowId
     , ParentId
     , ChildId
     , 0 AS RecursionLevel
FROM HierarchyTable
WHERE RowId = {start_row_id}
""")

# Register the initial result as a temporary view for recursion
df.createOrReplaceTempView('recursion_df')

# Use loop to simulate recursion
while True:
   
    bill_df = spark.sql(f"""
    SELECT t.RowId
         , t.ParentId
         , t.ChildId
         , {i} AS RecursionLevel
    FROM recursion_df cte
    INNER JOIN HierarchyTable t ON t.ParentId = cte.RowId
    """)

   
    bill_df.createOrReplaceTempView('recursion_df')

   
    df = df.union(bill_df)

    # Check if the recursion has produced any results
    if bill_df.count() == 0:
        df.createOrReplaceTempView("final_df")
        break
    else:
        i += 1  # Increment recursion level


display(df)

bmhardy
New Contributor III

Thank you, I will give this a try. I'll let you know how it goes.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group