topic Unsupported datatype 'TimestampNTZType' with liquid clustering in Data Engineering

Unsupported datatype 'TimestampNTZType' with liquid clustering

hukel — Tue, 26 Dec 2023 22:55:57 GMT

I'm experimenting with liquid clustering and have some questions about compatible types (somewhat similar to ).

Table created as

CREATE TABLE IF NOT EXISTS <TABLE> ( _time DOUBLE , timestamp TIMESTAMP_NTZ , aid STRING , aip STRING , cid STRING , TargetProcessId BIGINT ) USING delta CLUSTER BY (timestamp,aid,TargetProcessId) LOCATION <LOCATION>

Table is filled via .writeStream() so I'm under the impression that data won't be clustered on write. Therefore, I run

OPTIMIZE <TABLE>

which results in (with "spark_version": "14.2.x-cpu-ml-scala2.12"):

IllegalArgumentException: Unsupported datatype 'TimestampNTZType' com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: java.lang.IllegalArgumentException: Unsupported datatype 'TimestampNTZType' at com.databricks.sql.io.skipping.Classifier$.getStatsForCol(Classifier.scala:180) at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingBinaryNode(Classifier.scala:424)

I was able to create the table with the CLUSTER BY command so I assumed the column list was acceptable, but this exception indicates otherwise. If TimestampNTZType is an unsupported datatype, where is this documented?

When I run DESCRIBE on the table, it indicates the cluster columns were accepted:

format	partitionColumns	clusteringColumns	numFiles
delta	[]	["timestamp","aid","TargetProcessId"]	382

Re: Unsupported datatype 'TimestampNTZType' with liquid clustering

Wojciech_BUK — Wed, 27 Dec 2023 13:42:57 GMT

Hi,

just educated guess:

There is limitation in liquid clustering docs: You can only specify columns with statistics collected for clustering keys

Perhaps it is related to data types for which you can collect statistics?

But i could not find related docs either 😞

Re: Unsupported datatype 'TimestampNTZType' with liquid clustering

hukel — Wed, 27 Dec 2023 14:16:27 GMT

Yes, I think you are correct. When I run this,

describe extended <table> timestamp

I can see that no statistics are being calculated. Now I will have to ask a rookie question, is there a list of data types that do not have statistics tracked?

col_name timestamp
data_type timestamp_ntz
comment NULL
min NULL
max NULL
num_nulls NULL
distinct_count NULL
avg_col_len NULL
max_col_len NULL
histogram NULL

Re: Unsupported datatype 'TimestampNTZType' with liquid clustering

hukel — Wed, 27 Dec 2023 15:19:33 GMT

Running this fills up the statistics for the columns.

ANALYZE TABLE <TABLE> COMPUTE STATISTICS FOR COLUMNS timestamp,aid,ContextProcessId

But I still get the error when I run OPTIMIZE:

Unsupported datatype 'TimestampNTZType' com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: java.lang.IllegalArgumentException: Unsupported datatype 'TimestampNTZType' at com.databricks.sql.io.skipping.Classifier$.getStatsForCol(Classifier.scala:180) at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingBinaryNode(Classifier.scala:424) at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingNode(Classifier.scala:405) at com.databricks.sql.io.skipping.SimpleClassifier.$anonfun$classifyForOptimize$3(Classifier.scala:268) at com.databricks.sql.io.skipping.SimpleClassifier.$anonfun$classifyForOptimize$3$adapted(Classifier.scala:266) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)

Re: Unsupported datatype 'TimestampNTZType' with liquid clustering

Wojciech_BUK — Wed, 27 Dec 2023 16:01:26 GMT

Sorry, cant find related docs for that limitation 😞

Re: Unsupported datatype 'TimestampNTZType' with liquid clustering

hukel — Wed, 27 Dec 2023 16:38:00 GMT

I'm not sure if this is related, but I've hit another challenge with TIMESTAMP_NTZ columns

As soon as I calculate the statistics on a TIMESTAMP_NTZ column in a table, I can't use that column in a WHERE clause date range.

This query

-- set the variable in advance to eliminate any cast/conversion issues in WHERE clause DECLARE OR REPLACE VARIABLE dt_begin TIMESTAMP_NTZ DEFAULT '2023-12-04T16:00'; SELECT event.timestamp as time , event.FirstIP4Record as DestIP , event.DomainName as DNSName , dt_begin FROM dnsrequest event WHERE timestamp >= dt_begin -- timestamp IS NOT NULL

returns

MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$) com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: scala.MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$) at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.evaluateBinary(FilterEstimation.scala:523) at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.calculateSingleCondition(FilterEstimation.scala:400)

Re: Unsupported datatype 'TimestampNTZType' with liquid clustering

hukel — Mon, 08 Jan 2024 16:08:55 GMT

Per support, "TimestampNTZ data skipping is not yet supported".