01-21-2025 11:38 AM
I am looking to build a hierarchy from a parent child relationship table, which I would typically use a recursive statement for in SQL Server / Azure SQL. This would mean setting an anchor, most commonly the top record of the tree, and then join back to itself as it recurses through the relationships to build the hierarchichal structure. The problem I have encountered with Databricks is that it doesn't support ANY types of recursive statement, including CTE's and while loops.
The table structure is:
RowId | ParentId | ChildId
The parent Id references the row Id of the parent record. I have tried using SQL and also tried with PySpark yet I am out of ideas. This is a really important part of using Databricks as we need to return near real-time aggregate data. Any suggestions would be greatly appreciated.
02-01-2025 12:00 AM
Use a loop to iteratively join the DataFrame with itself to build the hierarchy. Each iteration will add a new level to the hierarchy.
We also have CTE support https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-qry-select-cte.html
02-01-2025 12:02 PM
Hi @bmhardy ,
As @NandiniN suggested you can use iterative approach to simulate recursion. I think in the databricks roadmap there's a plan to support recursive cte.
sample_data = [
(1, None, 2),
(2, 1, 3),
(3, 2, 4),
(4, 3, None)
]
columns = ["RowId", "ParentId", "ChildId"]
df = spark.createDataFrame(sample_data, columns)
df.createOrReplaceTempView("HierarchyTable")
# Initialize parameters
i = 1
start_row_id = 1 # The starting RowId (replace with your root node)
# Recursive anchor
df = spark.sql(f"""
SELECT RowId
, ParentId
, ChildId
, 0 AS RecursionLevel
FROM HierarchyTable
WHERE RowId = {start_row_id}
""")
# Register the initial result as a temporary view for recursion
df.createOrReplaceTempView('recursion_df')
# Use loop to simulate recursion
while True:
bill_df = spark.sql(f"""
SELECT t.RowId
, t.ParentId
, t.ChildId
, {i} AS RecursionLevel
FROM recursion_df cte
INNER JOIN HierarchyTable t ON t.ParentId = cte.RowId
""")
bill_df.createOrReplaceTempView('recursion_df')
df = df.union(bill_df)
# Check if the recursion has produced any results
if bill_df.count() == 0:
df.createOrReplaceTempView("final_df")
break
else:
i += 1 # Increment recursion level
display(df)
02-18-2025 07:43 AM
Unfortunately I get the same result as the other methods. It only returns the first level and nothing beyond that. It is probably worth mentioning that I am using Delta Live Tables.
02-03-2025 05:00 AM
Thank you, I will give this a try. I'll let you know how it goes.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now