cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Hierarchy roll up aggregation

bmhardy
New Contributor II

I have just learnt that recursive CTE's are not supported in Databricks SQL, however, we are looking to shift the complex aggregations into Databricks instead of relying on Azure SQL DB.

We are using Azure SQL DB with CDC enabled in combination with ADF to populate the DLT's in Databricks. From here we are trying to aggregate in two separate dimensions, which are by hierarchy (using a parent child table for reference) and then by period (day, week, month, year etc.). Our data engineering team have a stronger background in SQL and more of a DBA related past, so Python and Spark are newer conepts for us.

From my understanding it appears that PySpark may be the only option to achieve the rolling up of figures through a hierarchy. Is this correct? If so, does anyone know of any supporting material to be able to achieve this?

Primarily we will use two source tables, which are a team hierarchy with parent and child relationships then a transactional table with a team Id, date, and the values we are looking to aggregate up through the hierarchy.

3 REPLIES 3

ameet9257
New Contributor II

It's true. One month back I was also trying to solve a similar hierarchy problem and realized that recursive CTE is not supported in Databricks SQL. Now the only option is to go withPySpark

If you can describe your problem statement more, I will help you with the solution. 

Thanks.

Amit Prajapati

bmhardy
New Contributor II

Thank you Amit.

We use a 5 layer process, instead of the traditional medallion bronze, silver, gold. The materialised views are split into two functions instead of just gold, where we have 'calculated' and 'aggregated'. Once we have performed any calculations we need we than pass it over to the aggregated views to roll up a company level hierarchy. We were hoping to use a recursive CTE as a template to roll everything up in the same way each time, like the query below.

Our company hierarchy table (screenshot attached) contains the parent child relationship but the parent Id is the row Id of the parent not the Id of the parent team itself. In the screenshot all the child teams are related to the top row.

-----------------------------------------------------------------	
-- PREPARE SUBSET OF DATA
-----------------------------------------------------------------
	DECLARE
		@StartDate		int = 20240715,
		@EndDate		int = 20240721;

	DECLARE
		@Workload	TABLE(
		[TeamId]		int,
		[DateId]		int,
		[Duration]		int
	)
	
	INSERT INTO
		@Workload
	SELECT
		[w].[WorkgroupId],
		[w].[DateId],
		[w].[Duration]
	FROM
		[Workload] w
	WHERE
		[w].[Verified] = 1
		AND [w].[DateId] BETWEEN @StartDate AND @EndDate;

-----------------------------------------------------------------	
-- BUILD HIERARCHY
-----------------------------------------------------------------

	WITH [Hierarchy] (
		[TopRowId],
		[RowId],
		[TeamId],
		[TopWorkgroupId])
	AS (
		SELECT
			[ch].[Id],
			[ch].[Id],
			[ch].[TeamId],
			[ch].[TeamId]
		FROM
			[CompanyHierarchy] ch
		WHERE
			[ch].[EndDate] IS NULL
			AND ([ch].[RootNodeId] = 1 OR [ch].[TeamId] = 1)

		UNION ALL
	
		SELECT
			[h].[TopRowId],
			[wh].[Id],
			[wh].[WorkgroupId],
			[tr].[WorkgroupId]
		FROM
			[Hierarchy] h
			JOIN [CompanyHierarchy] ch
				ON [h].[RowId] = [ch].[ParentId]
			JOIN [CompanyHierarchy] tr
				ON [h].[TopRowId] = [tr].[Id]
		WHERE
			[ch].[EndDate] IS NULL
		)

-----------------------------------------------------------------	
-- JOIN DATA TO HIERARCHY AND ROLL UP
-----------------------------------------------------------------
	
	SELECT
		[h].[TopWorkgroupId],
		[t].[Name] AS [Team],
		[d].[DateId],
		SUM(CAST(ISNULL([d].[Duration] / 60.0,0.0) AS decimal(10,2))) AS [Amount accumulated]
	FROM
		[Hierarchy] h
		JOIN @Workload w
			ON [h].[WorkgroupId] = [w].[TeamId]
		JOIN [Teams] t
			ON [h].[TopWorkgroupId] = [t].[Id]
	GROUP BY
		[h].[TopWorkgroupId],
		[t].[Name],
		[w].[DateId]
	ORDER BY
		[t].[Name],
		[w].[DateId]

-----------------------------------------------------------------

-----------------------------------------------------------------

 

bmhardy
New Contributor II

We have gone down a different route where we are using SQL for our calculated layer and then a Python notebook for our aggregated layer. It is much easier to roll up data using a Python UDF than it is trying to work out how to do it in SQL.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group