cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unsupported datatype 'TimestampNTZType' with liquid clustering

hukel
Contributor

I'm experimenting with liquid clustering and have some questions about compatible types  (somewhat similar to Liquid clustering with boolean columns ).

Table created as

 

CREATE TABLE IF NOT EXISTS <TABLE>
(  
    _time DOUBLE
  , timestamp TIMESTAMP_NTZ
  , aid STRING
  , aip STRING
  , cid STRING
  , TargetProcessId BIGINT
)
USING delta CLUSTER BY (timestamp,aid,TargetProcessId) LOCATION <LOCATION>

 

Table is filled via .writeStream() so I'm under the impression that data won't be clustered on write.   Therefore, I run

 

OPTIMIZE <TABLE>

 

which results in (with "spark_version": "14.2.x-cpu-ml-scala2.12"):

 

IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: java.lang.IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
	at com.databricks.sql.io.skipping.Classifier$.getStatsForCol(Classifier.scala:180)
	at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingBinaryNode(Classifier.scala:424)

 

I was able to create the table with the CLUSTER BY command so I assumed the column list was acceptable,  but this exception indicates otherwise.   If TimestampNTZType is an unsupported datatype,  where is this documented?

When I run DESCRIBE on the table, it indicates the cluster columns were accepted:

formatpartitionColumnsclusteringColumnsnumFiles
delta[]["timestamp","aid","TargetProcessId"]382
6 REPLIES 6

Wojciech_BUK
Valued Contributor III

Hi,

just educated guess:

There is limitation in liquid clustering docs: You can only specify columns with statistics collected for clustering keys

Perhaps it is related to data types for which you can collect statistics?

But i could not find related docs either 😞

Yes,  I think you are correct.   When I run this,

describe extended <table> timestamp

I can see that no statistics are being calculated.    Now I will have to ask a rookie question,  is there a list of data types that do not have statistics tracked?

col_name timestamp
data_type timestamp_ntz
comment NULL
min NULL
max NULL
num_nulls NULL
distinct_count NULL
avg_col_len NULL
max_col_len NULL
histogram NULL

Running this fills up the statistics for the columns.

ANALYZE TABLE <TABLE> COMPUTE STATISTICS FOR COLUMNS timestamp,aid,ContextProcessId

But I still get the error when I run OPTIMIZE:

Unsupported datatype 'TimestampNTZType'
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: java.lang.IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
	at com.databricks.sql.io.skipping.Classifier$.getStatsForCol(Classifier.scala:180)
	at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingBinaryNode(Classifier.scala:424)
	at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingNode(Classifier.scala:405)
	at com.databricks.sql.io.skipping.SimpleClassifier.$anonfun$classifyForOptimize$3(Classifier.scala:268)
	at com.databricks.sql.io.skipping.SimpleClassifier.$anonfun$classifyForOptimize$3$adapted(Classifier.scala:266)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)

 

Wojciech_BUK
Valued Contributor III

Sorry, cant find related docs for that limitation 😞 

I'm not sure if this is related, but I've hit another challenge with TIMESTAMP_NTZ columns

As soon as I calculate the statistics on a TIMESTAMP_NTZ column in a table, I can't use that column in a WHERE clause date range.

This query

 

-- set the variable in advance to eliminate any cast/conversion issues in WHERE clause
DECLARE OR REPLACE VARIABLE dt_begin TIMESTAMP_NTZ DEFAULT '2023-12-04T16:00';

SELECT    event.timestamp as time
        , event.FirstIP4Record as DestIP        
        , event.DomainName as DNSName
        , dt_begin
FROM dnsrequest event
WHERE 
  timestamp >= dt_begin
--   timestamp IS NOT NULL

 

returns

 

MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: scala.MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)
	at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.evaluateBinary(FilterEstimation.scala:523)
	at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.calculateSingleCondition(FilterEstimation.scala:400)

 

Per support,  "TimestampNTZ data skipping is not yet supported".   

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group