Databricks Community

bmhardy · ‎11-12-2024

I have just learnt that recursive CTE's are not supported in Databricks SQL, however, we are looking to shift the complex aggregations into Databricks instead of relying on Azure SQL DB.

We are using Azure SQL DB with CDC enabled in combination with ADF to populate the DLT's in Databricks. From here we are trying to aggregate in two separate dimensions, which are by hierarchy (using a parent child table for reference) and then by period (day, week, month, year etc.). Our data engineering team have a stronger background in SQL and more of a DBA related past, so Python and Spark are newer conepts for us.

From my understanding it appears that PySpark may be the only option to achieve the rolling up of figures through a hierarchy. Is this correct? If so, does anyone know of any supporting material to be able to achieve this?

Primarily we will use two source tables, which are a team hierarchy with parent and child relationships then a transactional table with a team Id, date, and the values we are looking to aggregate up through the hierarchy.

ameet9257 · ‎11-12-2024

It's true. One month back I was also trying to solve a similar hierarchy problem and realized that recursive CTE is not supported in Databricks SQL. Now the only option is to go withPySpark.

If you can describe your problem statement more, I will help you with the solution.

Thanks.

Amit Prajapati

bmhardy · ‎11-13-2024

Thank you Amit.

We use a 5 layer process, instead of the traditional medallion bronze, silver, gold. The materialised views are split into two functions instead of just gold, where we have 'calculated' and 'aggregated'. Once we have performed any calculations we need we than pass it over to the aggregated views to roll up a company level hierarchy. We were hoping to use a recursive CTE as a template to roll everything up in the same way each time, like the query below.

Our company hierarchy table (screenshot attached) contains the parent child relationship but the parent Id is the row Id of the parent not the Id of the parent team itself. In the screenshot all the child teams are related to the top row.

-----------------------------------------------------------------	
-- PREPARE SUBSET OF DATA
-----------------------------------------------------------------
	DECLARE
		@StartDate		int = 20240715,
		@EndDate		int = 20240721;

	DECLARE
		@Workload	TABLE(
		[TeamId]		int,
		[DateId]		int,
		[Duration]		int
	)
	
	INSERT INTO
		@Workload
	SELECT
		[w].[WorkgroupId],
		[w].[DateId],
		[w].[Duration]
	FROM
		[Workload] w
	WHERE
		[w].[Verified] = 1
		AND [w].[DateId] BETWEEN @StartDate AND @EndDate;

-----------------------------------------------------------------	
-- BUILD HIERARCHY
-----------------------------------------------------------------

	WITH [Hierarchy] (
		[TopRowId],
		[RowId],
		[TeamId],
		[TopWorkgroupId])
	AS (
		SELECT
			[ch].[Id],
			[ch].[Id],
			[ch].[TeamId],
			[ch].[TeamId]
		FROM
			[CompanyHierarchy] ch
		WHERE
			[ch].[EndDate] IS NULL
			AND ([ch].[RootNodeId] = 1 OR [ch].[TeamId] = 1)

		UNION ALL
	
		SELECT
			[h].[TopRowId],
			[wh].[Id],
			[wh].[WorkgroupId],
			[tr].[WorkgroupId]
		FROM
			[Hierarchy] h
			JOIN [CompanyHierarchy] ch
				ON [h].[RowId] = [ch].[ParentId]
			JOIN [CompanyHierarchy] tr
				ON [h].[TopRowId] = [tr].[Id]
		WHERE
			[ch].[EndDate] IS NULL
		)

-----------------------------------------------------------------	
-- JOIN DATA TO HIERARCHY AND ROLL UP
-----------------------------------------------------------------
	
	SELECT
		[h].[TopWorkgroupId],
		[t].[Name] AS [Team],
		[d].[DateId],
		SUM(CAST(ISNULL([d].[Duration] / 60.0,0.0) AS decimal(10,2))) AS [Amount accumulated]
	FROM
		[Hierarchy] h
		JOIN @Workload w
			ON [h].[WorkgroupId] = [w].[TeamId]
		JOIN [Teams] t
			ON [h].[TopWorkgroupId] = [t].[Id]
	GROUP BY
		[h].[TopWorkgroupId],
		[t].[Name],
		[w].[DateId]
	ORDER BY
		[t].[Name],
		[w].[DateId]

-----------------------------------------------------------------

-----------------------------------------------------------------

bmhardy · ‎11-19-2024

We have gone down a different route where we are using SQL for our calculated layer and then a Python notebook for our aggregated layer. It is much easier to roll up data using a Python UDF than it is trying to work out how to do it in SQL.

ZaraYoung · ‎12-04-2024

Thanks for the info.

Now, I would like to help you by sharing the https://academized.com/dissertation-writing-service site. If you are not good at writing an essay assignment then you can visit the given link to find best essay writer.