โ12-26-2023 02:31 PM - edited โ12-26-2023 02:55 PM
I'm experimenting with liquid clustering and have some questions about compatible types (somewhat similar to Liquid clustering with boolean columns ).
Table created as
CREATE TABLE IF NOT EXISTS <TABLE>
(
_time DOUBLE
, timestamp TIMESTAMP_NTZ
, aid STRING
, aip STRING
, cid STRING
, TargetProcessId BIGINT
)
USING delta CLUSTER BY (timestamp,aid,TargetProcessId) LOCATION <LOCATION>
Table is filled via .writeStream() so I'm under the impression that data won't be clustered on write. Therefore, I run
OPTIMIZE <TABLE>
which results in (with "spark_version": "14.2.x-cpu-ml-scala2.12"):
IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: java.lang.IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
at com.databricks.sql.io.skipping.Classifier$.getStatsForCol(Classifier.scala:180)
at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingBinaryNode(Classifier.scala:424)
I was able to create the table with the CLUSTER BY command so I assumed the column list was acceptable, but this exception indicates otherwise. If TimestampNTZType is an unsupported datatype, where is this documented?
When I run DESCRIBE on the table, it indicates the cluster columns were accepted:
format | partitionColumns | clusteringColumns | numFiles |
delta | [] | ["timestamp","aid","TargetProcessId"] | 382 |
โ12-27-2023 05:42 AM
Hi,
just educated guess:
There is limitation in liquid clustering docs: You can only specify columns with statistics collected for clustering keys
Perhaps it is related to data types for which you can collect statistics?
But i could not find related docs either ๐
โ12-27-2023 06:16 AM
Yes, I think you are correct. When I run this,
describe extended <table> timestamp
I can see that no statistics are being calculated. Now I will have to ask a rookie question, is there a list of data types that do not have statistics tracked?
col_name timestamp
data_type timestamp_ntz
comment NULL
min NULL
max NULL
num_nulls NULL
distinct_count NULL
avg_col_len NULL
max_col_len NULL
histogram NULL
โ12-27-2023 07:19 AM
Running this fills up the statistics for the columns.
ANALYZE TABLE <TABLE> COMPUTE STATISTICS FOR COLUMNS timestamp,aid,ContextProcessId
But I still get the error when I run OPTIMIZE:
Unsupported datatype 'TimestampNTZType'
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: java.lang.IllegalArgumentException: Unsupported datatype 'TimestampNTZType'
at com.databricks.sql.io.skipping.Classifier$.getStatsForCol(Classifier.scala:180)
at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingBinaryNode(Classifier.scala:424)
at com.databricks.sql.io.skipping.SimpleClassifier.applyActionToIntersectingNode(Classifier.scala:405)
at com.databricks.sql.io.skipping.SimpleClassifier.$anonfun$classifyForOptimize$3(Classifier.scala:268)
at com.databricks.sql.io.skipping.SimpleClassifier.$anonfun$classifyForOptimize$3$adapted(Classifier.scala:266)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
โ12-27-2023 08:01 AM
Sorry, cant find related docs for that limitation ๐
โ12-27-2023 08:37 AM - edited โ12-27-2023 08:38 AM
I'm not sure if this is related, but I've hit another challenge with TIMESTAMP_NTZ columns
As soon as I calculate the statistics on a TIMESTAMP_NTZ column in a table, I can't use that column in a WHERE clause date range.
This query
-- set the variable in advance to eliminate any cast/conversion issues in WHERE clause
DECLARE OR REPLACE VARIABLE dt_begin TIMESTAMP_NTZ DEFAULT '2023-12-04T16:00';
SELECT event.timestamp as time
, event.FirstIP4Record as DestIP
, event.DomainName as DNSName
, dt_begin
FROM dnsrequest event
WHERE
timestamp >= dt_begin
-- timestamp IS NOT NULL
returns
MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)
com.databricks.backend.common.rpc.SparkDriverExceptions$SQLExecutionException: scala.MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)
at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.evaluateBinary(FilterEstimation.scala:523)
at com.databricks.sql.optimizer.statsEstimation.FilterEstimation.calculateSingleCondition(FilterEstimation.scala:400)
โ01-08-2024 08:08 AM
Per support, "TimestampNTZ data skipping is not yet supported".
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group