cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta log statistics - timestamp type not working

pantelis_mare
Contributor III

Hello team!

As per the documentation, I understand that the table statistics can be fetched through the delta log (eg min, max, count) in order to not read the underlying data of a delta table.

This is the case for numerical types, and timestamp is supposed to be supported.

In my test example though, when it comes to the timestamp column, the data are fetched as shown by the query plan, instead of reading the delta_log.

Any thoughts on how to make spark read the max values of the timestamp through the delta log?

Delta table creation:

Seq(
  (1, java.sql.Timestamp.valueOf(java.time.LocalDateTime.now())), 
  (2, java.sql.Timestamp.valueOf(java.time.LocalDateTime.now().minusDays(1))),
  (3, java.sql.Timestamp.valueOf(java.time.LocalDateTime.now().minusDays(2)))
  )
.toDF("value", "ts")
.write
.saveAsTable("test.test_ts")

As expected, max for column value is read by the log:

max valuebut this is not the case for the ts column:

image.pngwhile the delta log contains the needed information (max of ts column):

{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"9f6cbb6f-e866-4279-89a3-609bacf175cb","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"ts\",\"type\":\"timestamp\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1658749213255}}
{"add":{"path":"part-00000-0848b536-6c7b-4429-b168-44d53ec48ba4-c000.snappy.parquet","partitionValues":{},"size":852,"modificationTime":1658749217000,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"value\":1,\"ts\":\"2022-07-25T11:40:12.884Z\"},\"maxValues\":{\"value\":1,\"ts\":\"2022-07-25T11:40:12.884Z\"},\"nullCount\":{\"value\":0,\"ts\":0}}","tags":{"INSERTION_TIME":"1658749217000000","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00001-0c883bda-bfc3-4921-a46f-32eaa9fb7dbc-c000.snappy.parquet","partitionValues":{},"size":852,"modificationTime":1658749217000,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"value\":2,\"ts\":\"2022-07-24T11:40:12.885Z\"},\"maxValues\":{\"value\":2,\"ts\":\"2022-07-24T11:40:12.885Z\"},\"nullCount\":{\"value\":0,\"ts\":0}}","tags":{"INSERTION_TIME":"1658749217000001","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00002-54975e9b-3df8-4382-afa0-2929778ecb0a-c000.snappy.parquet","partitionValues":{},"size":853,"modificationTime":1658749217000,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"value\":3,\"ts\":\"2022-07-23T11:40:12.885Z\"},\"maxValues\":{\"value\":3,\"ts\":\"2022-07-23T11:40:12.885Z\"},\"nullCount\":{\"value\":0,\"ts\":0}}","tags":{"INSERTION_TIME":"1658749217000002","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"commitInfo":{"timestamp":1658749218051,"userId":"6999390541537531","userName":"p.maroudis@kaizengaming.com","operation":"CREATE TABLE AS SELECT","operationParameters":{"isManaged":"true","description":null,"partitionBy":"[]","properties":"{}"},"notebook":{"notebookId":"2851472110292823"},"clusterId":"0920-080342-who267","isolationLevel":"WriteSerializable","isBlindAppend":true,"operationMetrics":{"numFiles":"3","numOutputRows":"3","numOutputBytes":"2557"},"engineInfo":"Databricks-Runtime/10.4.x-scala2.12","txnId":"775ac353-9df2-45a2-99d2-1a15783c0e40"}}

1 ACCEPTED SOLUTION

Accepted Solutions

After communication with our CSE in Databricks, apparently there was an issue with truncation and precision loss that made them disactivate this feature..

Actually I also had this issue on another project when trying to parse timestamp fields using json4s. It is using the SimpleDateFormat class to parse timestamps that does not support microsecond precision.. maybe this is related..

@Werner Stinckens​ thanks again for your help!

View solution in original post

6 REPLIES 6

-werners-
Esteemed Contributor III

are you sure the timestamp column is a valid spark-timestamp-type?

Yes! as the schema shows.. I also added an explicit cast to TimestampType (even though I create it as a java.sql.Timestamp anyway...

imageDo you also get the same behavior? at least in my test case...

-werners-
Esteemed Contributor III

Strange.

Have you checked the actual plan (spark ui)?

Yep.. attached it in my question.. Plus it's obvious by the number of tasks it is spawning.. it reads the whole table (more obvious in big tables with more read partitions).

Can you reproduce it on your side as well or is it just us?

-werners-
Esteemed Contributor III

same here

After communication with our CSE in Databricks, apparently there was an issue with truncation and precision loss that made them disactivate this feature..

Actually I also had this issue on another project when trying to parse timestamp fields using json4s. It is using the SimpleDateFormat class to parse timestamps that does not support microsecond precision.. maybe this is related..

@Werner Stinckens​ thanks again for your help!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.