4 weeks ago
I have just learnt that recursive CTE's are not supported in Databricks SQL, however, we are looking to shift the complex aggregations into Databricks instead of relying on Azure SQL DB.
We are using Azure SQL DB with CDC enabled in combination with ADF to populate the DLT's in Databricks. From here we are trying to aggregate in two separate dimensions, which are by hierarchy (using a parent child table for reference) and then by period (day, week, month, year etc.). Our data engineering team have a stronger background in SQL and more of a DBA related past, so Python and Spark are newer conepts for us.
From my understanding it appears that PySpark may be the only option to achieve the rolling up of figures through a hierarchy. Is this correct? If so, does anyone know of any supporting material to be able to achieve this?
Primarily we will use two source tables, which are a team hierarchy with parent and child relationships then a transactional table with a team Id, date, and the values we are looking to aggregate up through the hierarchy.
4 weeks ago
It's true. One month back I was also trying to solve a similar hierarchy problem and realized that recursive CTE is not supported in Databricks SQL. Now the only option is to go withPySpark.
If you can describe your problem statement more, I will help you with the solution.
Thanks.
4 weeks ago
Thank you Amit.
We use a 5 layer process, instead of the traditional medallion bronze, silver, gold. The materialised views are split into two functions instead of just gold, where we have 'calculated' and 'aggregated'. Once we have performed any calculations we need we than pass it over to the aggregated views to roll up a company level hierarchy. We were hoping to use a recursive CTE as a template to roll everything up in the same way each time, like the query below.
Our company hierarchy table (screenshot attached) contains the parent child relationship but the parent Id is the row Id of the parent not the Id of the parent team itself. In the screenshot all the child teams are related to the top row.
-----------------------------------------------------------------
-- PREPARE SUBSET OF DATA
-----------------------------------------------------------------
DECLARE
@StartDate int = 20240715,
@EndDate int = 20240721;
DECLARE
@Workload TABLE(
[TeamId] int,
[DateId] int,
[Duration] int
)
INSERT INTO
@Workload
SELECT
[w].[WorkgroupId],
[w].[DateId],
[w].[Duration]
FROM
[Workload] w
WHERE
[w].[Verified] = 1
AND [w].[DateId] BETWEEN @StartDate AND @EndDate;
-----------------------------------------------------------------
-- BUILD HIERARCHY
-----------------------------------------------------------------
WITH [Hierarchy] (
[TopRowId],
[RowId],
[TeamId],
[TopWorkgroupId])
AS (
SELECT
[ch].[Id],
[ch].[Id],
[ch].[TeamId],
[ch].[TeamId]
FROM
[CompanyHierarchy] ch
WHERE
[ch].[EndDate] IS NULL
AND ([ch].[RootNodeId] = 1 OR [ch].[TeamId] = 1)
UNION ALL
SELECT
[h].[TopRowId],
[wh].[Id],
[wh].[WorkgroupId],
[tr].[WorkgroupId]
FROM
[Hierarchy] h
JOIN [CompanyHierarchy] ch
ON [h].[RowId] = [ch].[ParentId]
JOIN [CompanyHierarchy] tr
ON [h].[TopRowId] = [tr].[Id]
WHERE
[ch].[EndDate] IS NULL
)
-----------------------------------------------------------------
-- JOIN DATA TO HIERARCHY AND ROLL UP
-----------------------------------------------------------------
SELECT
[h].[TopWorkgroupId],
[t].[Name] AS [Team],
[d].[DateId],
SUM(CAST(ISNULL([d].[Duration] / 60.0,0.0) AS decimal(10,2))) AS [Amount accumulated]
FROM
[Hierarchy] h
JOIN @Workload w
ON [h].[WorkgroupId] = [w].[TeamId]
JOIN [Teams] t
ON [h].[TopWorkgroupId] = [t].[Id]
GROUP BY
[h].[TopWorkgroupId],
[t].[Name],
[w].[DateId]
ORDER BY
[t].[Name],
[w].[DateId]
-----------------------------------------------------------------
-----------------------------------------------------------------
3 weeks ago
We have gone down a different route where we are using SQL for our calculated layer and then a Python notebook for our aggregated layer. It is much easier to roll up data using a Python UDF than it is trying to work out how to do it in SQL.
Wednesday
Thanks for the info.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group