โ09-17-2024 08:23 PM
Maintaining accurate and up-to-date calendar date tables is crucial for reliable reporting, yet manual updates can be time-consuming and prone to error. This fundamental component serves as the backbone for date-based analysis, enabling accurate reporting across various date dimensions. Whether you are tracking sales trends, analysing employee performance, or forecasting future metrics, a robust calendar dates table is essential for deriving meaningful insights from your data.
Databricks, with its SQL engine, offers an ideal environment for creating and maintaining such a crucial element. By leveraging Databricksโ capabilities, organisations can automate the process of generating and updating their calendar dates table.
In this blog post, we will explore how to harness Databricksโ serverless SQL and workflow (scheduling) functionalities to build and refresh a calendar dates table tailored for Business Intelligence applications.
We will cover the creation of the table itself, setting up a workflow. By the end, you will have a clear understanding of how to implement this solution.
Previously, we used to leverage a Python-driven notebook to build and refresh our data. We decided to now leverage SQL instead of Python as most of our customers are more at ease with the latter.
First things first, we need to define the start and end date of our table. In order to do so, we will leverage the SQL function sequence. Long story short, the sequence function is used to create an array of elements from start to stop and incrementing by a step. The below code will generate an array containing dates from January 1st 2020, to January 1st 2050.
Now the second step is to literally explode that array into proper rows. In order to do so, we will leverage the SQL function explode. Basically, this function will return a set of rows by un-nesting our array (collection).
In two very easy and simple steps we now have the foundation of our date dimension.
Next steps is to capitalise on those two steps and start creating various date related columns based on the value of the calendar_date column. To do so, we will be using a CTE (Common table Expression) definition to factorise and reuse our code in a neat manner.
Now comes the most interesting part which is defining and adding the Business Intelligence relevant date attributes. The below query enriches our table with additional attributes useful for BI reporting, including fiscal periods (assuming a fiscal year starting April 1).
Many of the built-in SQL functions available in Databricks will look very familiar to experienced data engineers who have worked with on-premises relational database management systems (RDBMS) like Oracle, SQL Server, or PostgreSQL. This similarity eases the transition for those moving from traditional data warehousing to Databricks.
For instance, in our calendar dimension example, functions such as year(), month(), dayofweek(), and weekofyear() are nearly identical in syntax and functionality to their counterparts in traditional RDBMS systems. This allows data engineers to leverage their existing SQL knowledge when building date-related dimensions and fact tables in Databricks.
with calendarDates as ( select explode(array_dates) as calendar_date from ( select sequence( make_date(2020, 01, 01), make_date(2050, 01, 01), interval 1 day ) as array_dates ) ) select 10000 * year(calendar_date) + 100 * month(calendar_date) + day(calendar_date) as calendar_date_key, to_date(calendar_date) as calendar_date, year(calendar_date) calendar_year, month(calendar_date) calendar_month, day(calendar_date) calendar_day_of_month, date_format(calendar_date, 'EEEE MMMM dd yyyy') calendar_date_full, date_format(calendar_date, 'EEEE') calendar_day_name, date_add(to_date(calendar_date), -1) as calendar_date_previous_day, date_add(to_date(calendar_date), 1) as calendar_date_next_day, case when date_add(calendar_date, (weekday(calendar_date) + 1) -1) = calendar_date then to_date(calendar_date) else date_add(calendar_date, -(weekday(calendar_date))) end as calendar_week_start, date_add( case when date_add(calendar_date, (weekday(calendar_date) + 1) -1) = calendar_date then to_date(calendar_date) else date_add(calendar_date, -(weekday(calendar_date))) end, 6 ) as calendar_week_end, weekday(calendar_date) + 1 as calendar_week_day, weekofyear(calendar_date) calendar_week_of_year, date_format(calendar_date, 'MMMM yyyy') calendar_month_year, date_format(calendar_date, 'MMMM') calendar_month_name, date_add(last_day(add_months(calendar_date, -1)), 1) calendar_first_day_of_month, last_day(calendar_date) calendar_last_day_of_month, case when month(calendar_date) in (1, 2, 3) then 1 when month(calendar_date) in (4, 5, 6) then 2 when month(calendar_date) in (7, 8, 9) then 3 else 4 end AS fiscal_quarter, year(date_add(calendar_date, 89)) AS fiscal_year, case when to_date(now()) = calendar_date then true else false end as current_day, CASE WHEN to_date(now()) BETWEEN ( case when date_add(calendar_date, (weekday(calendar_date) + 1) -1) = calendar_date then to_date(calendar_date) else date_add(calendar_date, -(weekday(calendar_date))) end ) AND ( date_add( case when date_add(calendar_date, (weekday(calendar_date) + 1) -1) = calendar_date then to_date(calendar_date) else date_add(calendar_date, -(weekday(calendar_date))) end, 6 ) ) THEN true else false end as current_week, case when month(to_date(now())) = month(calendar_date) and year(to_date(now())) = year(calendar_date) then true else false end as current_month, case when year(to_date(now())) = year(calendar_date) then true else false end as current_year from calendarDates limit 10;
Now that we defined our SQL, we need to encapsulate it inside a Create Table statement, then we will parametrise our notebook to pass 2 parameters (widgets) defining which catalog and schema to use.
We will put that in motion directly in a Notebook as per the below:
The full notebook can be accessed here:
-- first let's set the timezone to the correct location. UTC is behind NZST or NZDT. SET timezone = Pacific/Auckland; -- let's use the right catalog and schema passed as parameters use catalog ${catalog_name}; use schema ${schema_name}; -- create a managed table in our Unity Catalog create or replace table calendar as -- CTE to simplify our SQL with calendarDates as ( select explode(array_dates) as calendar_date from ( select sequence( make_date(2020, 01, 01), -- start date make_date(2050, 01, 01), -- end date interval 1 day -- incremental step ) as array_dates ) ) -- the SQL transforming our main calendar_date into all the colunmns we will be requiring select 10000 * year(calendar_date) + 100 * month(calendar_date) + day(calendar_date) as calendar_date_key, -- calendar table primary key to_date(calendar_date) as calendar_date, year(calendar_date) calendar_year, month(calendar_date) calendar_month, day(calendar_date) calendar_day_of_month, date_format(calendar_date, 'EEEE MMMM dd yyyy') calendar_date_full, date_format(calendar_date, 'EEEE') calendar_day_name, date_add(to_date(calendar_date), -1) as calendar_date_previous_day, date_add(to_date(calendar_date), 1) as calendar_date_next_day, case when date_add(calendar_date, (weekday(calendar_date) + 1) -1) = calendar_date then to_date(calendar_date) else date_add(calendar_date, -(weekday(calendar_date))) end as calendar_week_start, -- start of week date date_add( case when date_add(calendar_date, (weekday(calendar_date) + 1) -1) = calendar_date then to_date(calendar_date) else date_add(calendar_date, -(weekday(calendar_date))) end, 6 ) as calendar_week_end, -- end of week date weekday(calendar_date) + 1 as calendar_week_day, weekofyear(calendar_date) calendar_week_of_year, date_format(calendar_date, 'MMMM yyyy') calendar_month_year, date_format(calendar_date, 'MMMM') calendar_month_name, date_add(last_day(add_months(calendar_date, -1)), 1) calendar_first_day_of_month, last_day(calendar_date) calendar_last_day_of_month, case when month(calendar_date) in (1, 2, 3) then 1 when month(calendar_date) in (4, 5, 6) then 2 when month(calendar_date) in (7, 8, 9) then 3 else 4 end AS fiscal_quarter, year(date_add(calendar_date, 89)) AS fiscal_year, case when to_date(now()) = calendar_date then true else false end as current_day, -- true if calendar day is today CASE WHEN to_date(now()) BETWEEN ( case when date_add(calendar_date, (weekday(calendar_date) + 1) -1) = calendar_date then to_date(calendar_date) else date_add(calendar_date, -(weekday(calendar_date))) end ) AND ( date_add( case when date_add(calendar_date, (weekday(calendar_date) + 1) -1) = calendar_date then to_date(calendar_date) else date_add(calendar_date, -(weekday(calendar_date))) end, 6 ) ) THEN true else false end as current_week, -- true if calendar day is in current week case when month(to_date(now())) = month(calendar_date) and year(to_date(now())) = year(calendar_date) then true else false end as current_month, -- true if calendar day is in current month case when year(to_date(now())) = year(calendar_date) then true else false end as current_year -- true is calendar days is in current year from calendarDates
-- modify our calendar table in order to set a PK (primary key) ALTER TABLE calendar ALTER COLUMN calendar_date_key SET NOT NULL; alter table calendar add primary key (calendar_date_key);
โ09-17-2024 08:29 PM
Very helpful, thanks for sharing!
โ09-17-2024 08:29 PM
Thanks, automating tasks like these can be a life saver ๐
โ09-17-2024 08:36 PM
Nice article, very informative!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group