Databricks Community

hukel · ‎12-26-2023

I'm experimenting with liquid clustering and have some questions about compatible types (somewhat similar to Liquid clustering with boolean columns ).

Table created as

CREATE TABLE IF NOT EXISTS <TABLE>
(  
    _time DOUBLE
  , timestamp TIMESTAMP_NTZ
  , aid STRING
  , aip STRING
  , cid STRING
  , TargetProcessId BIGINT
)
USING delta CLUSTER BY (timestamp,aid,TargetProcessId) LOCATION <LOCATION>

Table is filled via .writeStream() so I'm under the impression that data won't be clustered on write. Therefore, I run

OPTIMIZE <TABLE>

which results in (with "spark_version": "14.2.x-cpu-ml-scala2.12"):

IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: java.lang.IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
	at com.databricks.sql.io.skipping.Classifier$.getStatsForCol(Classifier.scala:180)
	at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingBinaryNode(Classifier.scala:424)

I was able to create the table with the CLUSTER BY command so I assumed the column list was acceptable, but this exception indicates otherwise. If TimestampNTZType is an unsupported datatype, where is this documented?

When I run DESCRIBE on the table, it indicates the cluster columns were accepted:

format	partitionColumns	clusteringColumns	numFiles
delta	[]	["timestamp","aid","TargetProcessId"]	382

Wojciech_BUK · ‎12-27-2023

Hi,

just educated guess:

There is limitation in liquid clustering docs: You can only specify columns with statistics collected for clustering keys

Perhaps it is related to data types for which you can collect statistics?

But i could not find related docs either 😞

hukel · ‎12-27-2023

Yes, I think you are correct. When I run this,

describe extended <table> timestamp

I can see that no statistics are being calculated. Now I will have to ask a rookie question, is there a list of data types that do not have statistics tracked?

col_name timestamp
data_type timestamp_ntz
comment NULL
min NULL
max NULL
num_nulls NULL
distinct_count NULL
avg_col_len NULL
max_col_len NULL
histogram NULL

hukel · ‎12-27-2023

Running this fills up the statistics for the columns.

ANALYZE TABLE <TABLE> COMPUTE STATISTICS FOR COLUMNS timestamp,aid,ContextProcessId

But I still get the error when I run OPTIMIZE:

Unsupported datatype 'TimestampNTZType'
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: java.lang.IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
	at com.databricks.sql.io.skipping.Classifier$.getStatsForCol(Classifier.scala:180)
	at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingBinaryNode(Classifier.scala:424)
	at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingNode(Classifier.scala:405)
	at com.databricks.sql.io.skipping.SimpleClassifier.$anonfun$classifyForOptimize$3(Classifier.scala:268)
	at com.databricks.sql.io.skipping.SimpleClassifier.$anonfun$classifyForOptimize$3$adapted(Classifier.scala:266)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)

Wojciech_BUK · ‎12-27-2023

Sorry, cant find related docs for that limitation 😞

hukel · ‎12-27-2023

I'm not sure if this is related, but I've hit another challenge with TIMESTAMP_NTZ columns

As soon as I calculate the statistics on a TIMESTAMP_NTZ column in a table, I can't use that column in a WHERE clause date range.

This query

-- set the variable in advance to eliminate any cast/conversion issues in WHERE clause
DECLARE OR REPLACE VARIABLE dt_begin TIMESTAMP_NTZ DEFAULT '2023-12-04T16:00';

SELECT    event.timestamp as time
        , event.FirstIP4Record as DestIP        
        , event.DomainName as DNSName
        , dt_begin
FROM dnsrequest event
WHERE 
  timestamp >= dt_begin
--   timestamp IS NOT NULL

returns

MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: scala.MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)
	at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.evaluateBinary(FilterEstimation.scala:523)
	at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.calculateSingleCondition(FilterEstimation.scala:400)