by Gengliang Wang, Serge Rielau, Wenchen Fan, and Ivan Sadikov
As the volume and variety of data grow, so does the need for refined data types to ensure consistency, accuracy, and efficiency. Time data is notoriously challenging due to the complexities associated with time zones. Apache Spark, in its pursuit to be the de-facto processing engine for big data, has recognized this challenge. In Spark 3.4, the community introduces the TIMESTAMP_NTZ type, a timestamp that operates without considering time zones. This blog post delves into the TIMESTAMP_NTZ type, its motivations, and how to make the most of it on Apache Spark and Databricks platform.
Before the Apache Spark 3.4 release, the default and only timestamp type behaved similarly to Oracle's TIMESTAMP WITH LOCAL TIME ZONE, which is stored via normalizing to a standard time zone (UTC). However, Operating in multi-timezone environments or transferring data across systems presents unique challenges with timestamps. In these contexts, querying tables becomes intricate. Filtering based on timestamps requires not just an understanding of the data, but also an awareness of the source and target time zones. This added layer of complexity can make queries prone to errors and harder to maintain.
Hence, there's a clear need for a timestamp type that remains consistent, and independent of the session's time zone. Enter TIMESTAMP_NTZ, which stands for "timestamp without time zone". This data type allows users to represent precise moments without the nuances of time zones.
Getting Started with TIMESTAMP_NTZ
Creating and manipulating tables/literals with the TIMESTAMP_NTZ type is simple. Here are some examples:
CREATE TABLE t(ts TIMESTAMP_NTZ) using parquet; ALTER TABLE t ADD COLUMN (ts2 TIMESTAMP_NTZ); SELECT CAST('2021-07-07 00:00:00' AS TIMESTAMP_NTZ); SELECT TIMESTAMP_NTZ'2021-07-07 00:00:00';
A standout feature introduced in Spark 3.4 is the convert_timezone() function. This function aids users in converting between time zones for TIMESTAMP_NTZ:
-- Convert from 'America/Los_Angeles' to 'UTC' SELECT convert_timezone('America/Los_Angeles', 'UTC', TIMESTAMP_NTZ'2021-12-06 00:00:00'); -- Output: 2021-12-06 08:00:00
TIMESTAMP_NTZ vs. TIMESTAMP
Given the following table recorded_times:
SET TIME ZONE 'America/Los_Angeles'; CREATE TABLE recorded_times (ts TIMESTAMP, ts_ntz TIMESTAMP_NTZ) USING parquet; INSERT INTO recorded_times VALUES (now(), now());
In this 'America/Los_Angeles' time zone, both TIMESTAMP and TIMESTAMP_NTZ capture the same time, say 2023-10-05 17:00:00.
However, when we switch our session to another time zone:
SET TIME ZONE 'America/New_York'; SELECT * FROM recorded_times;
The TIMESTAMP value adjusts to reflect the new time zone, while TIMESTAMP_NTZ remains consistent:
In 'America/Los_Angeles' Time Zone
In 'America/New_York' Time Zone
Interoperability and Support
The TIMESTAMP_NTZ type offers seamless conversion with Date, Timestamp, and String types. It's supported across Python, SQL, Scala, and Java in Spark. Moreover, it accommodates file sources such as Delta, Parquet, ORC, Avro, JSON, and CSV and ensures compatibility with Hive metastore and Unity Catalog.
The introduction of the TIMESTAMP_NTZ feature in Spark SQL highlights Spark's commitment to addressing the evolving needs of its users. If you're grappling with timestamp data and want to sidestep the complications of time zones, TIMESTAMP_NTZ is your go-to feature.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.