Unsupported datatype 'TimestampNTZType' with liquid clustering
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-26-2023 02:31 PM - edited 12-26-2023 02:55 PM
I'm experimenting with liquid clustering and have some questions about compatible types (somewhat similar to Liquid clustering with boolean columns ).
Table created as
CREATE TABLE IF NOT EXISTS <TABLE>
(
_time DOUBLE
, timestamp TIMESTAMP_NTZ
, aid STRING
, aip STRING
, cid STRING
, TargetProcessId BIGINT
)
USING delta CLUSTER BY (timestamp,aid,TargetProcessId) LOCATION <LOCATION>
Table is filled via .writeStream() so I'm under the impression that data won't be clustered on write. Therefore, I run
OPTIMIZE <TABLE>
which results in (with "spark_version": "14.2.x-cpu-ml-scala2.12"):
IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: java.lang.IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
at com.databricks.sql.io.skipping.Classifier$.getStatsForCol(Classifier.scala:180)
at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingBinaryNode(Classifier.scala:424)
I was able to create the table with the CLUSTER BY command so I assumed the column list was acceptable, but this exception indicates otherwise. If TimestampNTZType is an unsupported datatype, where is this documented?
When I run DESCRIBE on the table, it indicates the cluster columns were accepted:
format | partitionColumns | clusteringColumns | numFiles |
delta | [] | ["timestamp","aid","TargetProcessId"] | 382 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-27-2023 05:42 AM
Hi,
just educated guess:
There is limitation in liquid clustering docs: You can only specify columns with statistics collected for clustering keys
Perhaps it is related to data types for which you can collect statistics?
But i could not find related docs either 😞
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-27-2023 06:16 AM
Yes, I think you are correct. When I run this,
describe extended <table> timestamp
I can see that no statistics are being calculated. Now I will have to ask a rookie question, is there a list of data types that do not have statistics tracked?
col_name timestamp
data_type timestamp_ntz
comment NULL
min NULL
max NULL
num_nulls NULL
distinct_count NULL
avg_col_len NULL
max_col_len NULL
histogram NULL
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-27-2023 07:19 AM
Running this fills up the statistics for the columns.
ANALYZE TABLE <TABLE> COMPUTE STATISTICS FOR COLUMNS timestamp,aid,ContextProcessId
But I still get the error when I run OPTIMIZE:
Unsupported datatype 'TimestampNTZType'
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: java.lang.IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
at com.databricks.sql.io.skipping.Classifier$.getStatsForCol(Classifier.scala:180)
at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingBinaryNode(Classifier.scala:424)
at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingNode(Classifier.scala:405)
at com.databricks.sql.io.skipping.SimpleClassifier.$anonfun$classifyForOptimize$3(Classifier.scala:268)
at com.databricks.sql.io.skipping.SimpleClassifier.$anonfun$classifyForOptimize$3$adapted(Classifier.scala:266)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-27-2023 08:01 AM
Sorry, cant find related docs for that limitation 😞
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-27-2023 08:37 AM - edited 12-27-2023 08:38 AM
I'm not sure if this is related, but I've hit another challenge with TIMESTAMP_NTZ columns
As soon as I calculate the statistics on a TIMESTAMP_NTZ column in a table, I can't use that column in a WHERE clause date range.
This query
-- set the variable in advance to eliminate any cast/conversion issues in WHERE clause
DECLARE OR REPLACE VARIABLE dt_begin TIMESTAMP_NTZ DEFAULT '2023-12-04T16:00';
SELECT event.timestamp as time
, event.FirstIP4Record as DestIP
, event.DomainName as DNSName
, dt_begin
FROM dnsrequest event
WHERE
timestamp >= dt_begin
-- timestamp IS NOT NULL
returns
MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: scala.MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)
at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.evaluateBinary(FilterEstimation.scala:523)
at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.calculateSingleCondition(FilterEstimation.scala:400)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-08-2024 08:08 AM
Per support, "TimestampNTZ data skipping is not yet supported".