cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Unsupported datatype 'TimestampNTZType' with liquid clustering

hukel
Contributor

I'm experimenting with liquid clustering and have some questions about compatible types  (somewhat similar to Liquid clustering with boolean columns ).

Table created as

 

CREATE TABLE IF NOT EXISTS <TABLE>
(  
    _time DOUBLE
  , timestamp TIMESTAMP_NTZ
  , aid STRING
  , aip STRING
  , cid STRING
  , TargetProcessId BIGINT
)
USING delta CLUSTER BY (timestamp,aid,TargetProcessId) LOCATION <LOCATION>

 

Table is filled via .writeStream() so I'm under the impression that data won't be clustered on write.   Therefore, I run

 

OPTIMIZE <TABLE>

 

which results in (with "spark_version": "14.2.x-cpu-ml-scala2.12"):

 

IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: java.lang.IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
	at com.databricks.sql.io.skipping.Classifier$.getStatsForCol(Classifier.scala:180)
	at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingBinaryNode(Classifier.scala:424)

 

I was able to create the table with the CLUSTER BY command so I assumed the column list was acceptable,  but this exception indicates otherwise.   If TimestampNTZType is an unsupported datatype,  where is this documented?

When I run DESCRIBE on the table, it indicates the cluster columns were accepted:

formatpartitionColumnsclusteringColumnsnumFiles
delta[]["timestamp","aid","TargetProcessId"]382
6 REPLIES 6

Wojciech_BUK
Contributor III

Hi,

just educated guess:

There is limitation in liquid clustering docs: You can only specify columns with statistics collected for clustering keys

Perhaps it is related to data types for which you can collect statistics?

But i could not find related docs either ๐Ÿ˜ž

Yes,  I think you are correct.   When I run this,

describe extended <table> timestamp

I can see that no statistics are being calculated.    Now I will have to ask a rookie question,  is there a list of data types that do not have statistics tracked?

col_name timestamp
data_type timestamp_ntz
comment NULL
min NULL
max NULL
num_nulls NULL
distinct_count NULL
avg_col_len NULL
max_col_len NULL
histogram NULL

Running this fills up the statistics for the columns.

ANALYZE TABLE <TABLE> COMPUTE STATISTICS FOR COLUMNS timestamp,aid,ContextProcessId

But I still get the error when I run OPTIMIZE:

Unsupported datatype 'TimestampNTZType'
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: java.lang.IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
	at com.databricks.sql.io.skipping.Classifier$.getStatsForCol(Classifier.scala:180)
	at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingBinaryNode(Classifier.scala:424)
	at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingNode(Classifier.scala:405)
	at com.databricks.sql.io.skipping.SimpleClassifier.$anonfun$classifyForOptimize$3(Classifier.scala:268)
	at com.databricks.sql.io.skipping.SimpleClassifier.$anonfun$classifyForOptimize$3$adapted(Classifier.scala:266)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)

 

Sorry, cant find related docs for that limitation ๐Ÿ˜ž 

I'm not sure if this is related, but I've hit another challenge with TIMESTAMP_NTZ columns

As soon as I calculate the statistics on a TIMESTAMP_NTZ column in a table, I can't use that column in a WHERE clause date range.

This query

 

-- set the variable in advance to eliminate any cast/conversion issues in WHERE clause
DECLARE OR REPLACE VARIABLE dt_begin TIMESTAMP_NTZ DEFAULT '2023-12-04T16:00';

SELECT    event.timestamp as time
        , event.FirstIP4Record as DestIP        
        , event.DomainName as DNSName
        , dt_begin
FROM dnsrequest event
WHERE 
  timestamp >= dt_begin
--   timestamp IS NOT NULL

 

returns

 

MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: scala.MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)
	at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.evaluateBinary(FilterEstimation.scala:523)
	at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.calculateSingleCondition(FilterEstimation.scala:400)

 

Per support,  "TimestampNTZ data skipping is not yet supported".   

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.