Databricks Community

bmhardy · ‎01-21-2025

I am looking to build a hierarchy from a parent child relationship table, which I would typically use a recursive statement for in SQL Server / Azure SQL. This would mean setting an anchor, most commonly the top record of the tree, and then join back to itself as it recurses through the relationships to build the hierarchichal structure. The problem I have encountered with Databricks is that it doesn't support ANY types of recursive statement, including CTE's and while loops.

The table structure is:

RowId | ParentId | ChildId

The parent Id references the row Id of the parent record. I have tried using SQL and also tried with PySpark yet I am out of ideas. This is a really important part of using Databricks as we need to return near real-time aggregate data. Any suggestions would be greatly appreciated.

NandiniN · ‎02-01-2025

Use a loop to iteratively join the DataFrame with itself to build the hierarchy. Each iteration will add a new level to the hierarchy.

We also have CTE support https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-qry-select-cte.html

szymon_dybczak · ‎02-01-2025

Hi @bmhardy ,

As @NandiniN suggested you can use iterative approach to simulate recursion. I think in the databricks roadmap there's a plan to support recursive cte.

sample_data = [
    (1, None, 2),
    (2, 1, 3),
    (3, 2, 4),
    (4, 3, None)
]

columns = ["RowId", "ParentId", "ChildId"]
df = spark.createDataFrame(sample_data, columns)


df.createOrReplaceTempView("HierarchyTable")

# Initialize parameters
i = 1
start_row_id = 1  # The starting RowId (replace with your root node)

# Recursive anchor
df = spark.sql(f"""
SELECT RowId
     , ParentId
     , ChildId
     , 0 AS RecursionLevel
FROM HierarchyTable
WHERE RowId = {start_row_id}
""")

# Register the initial result as a temporary view for recursion
df.createOrReplaceTempView('recursion_df')

# Use loop to simulate recursion
while True:
   
    bill_df = spark.sql(f"""
    SELECT t.RowId
         , t.ParentId
         , t.ChildId
         , {i} AS RecursionLevel
    FROM recursion_df cte
    INNER JOIN HierarchyTable t ON t.ParentId = cte.RowId
    """)

   
    bill_df.createOrReplaceTempView('recursion_df')

   
    df = df.union(bill_df)

    # Check if the recursion has produced any results
    if bill_df.count() == 0:
        df.createOrReplaceTempView("final_df")
        break
    else:
        i += 1  # Increment recursion level


display(df)

bmhardy · ‎02-18-2025

Unfortunately I get the same result as the other methods. It only returns the first level and nothing beyond that. It is probably worth mentioning that I am using Delta Live Tables.